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BIALLELIC MARKERS FOR USE IN CONSTRUCTING A HIGH 
DENSITY DISEQUILIBRIUM MAP OF THE HUMAN GENOME 



5 Background of the Invention 

Recent advances in genetic engineering and bioinformatics have enabled the 
manipulation and characterization of large portions of the human genome. While efforts to 
obtain the full sequence of the human genome are rapidly progressing, there are many 
practical uses for genetic information which can be implemented with partial knowledge of 
10 the sequence of the human genome. 

As the full sequence of the human genome is assembled, the partial sequence 
information available can be used to identify genes responsible for detectable human traits, 
such as genes associated with human diseases, and to develop diagnostic tests capable of 
identifying individuals who express a detectable trait as the result of a specific genotype or 
1 5 individuals whose genotype places them at risk of developing a detectable trait at a subsequent 
time. Each of these applications for partial genomic sequence information is based upon the 
assembly of genetic and physical maps which order the known genomic sequences along the 
human chromosomes. 

The present invention relates to an ordered set of human genomic sequences 
20 comprising single nucleotide polymorphisms, as well as the use of these polymorphisms as a 
high resolution map of the human genome, methods of identifying genes associated with 
detectable human traits, and diagnostics for identifying individuals who carry a gene which 
causes them to express a detectable trait or which places them at risk of expressing a 
detectable trait in the future. 

25 

Advantages of the biallelic markers of the present invention 

The map-related biallelic markers of the present invention offer a number of important 
advantages over other genetic markers such as RFLP (Restriction fragment length 
polymorphism), VNTR (Variable Number of Tandem Repeats) markers and earlier STS- 

30 (sequence tagged sites) derived markers. 

The first generation of markers, were RFLPs, which are variations that modify the 
length of a restriction fragment. But methods used to identify and to type RFLPs are relatively 
wasteful of materials, effort, and time. Since they are biallelic markers (they present only two 
alleles, the restriction site being either present or absent), their maximum heterozygosity is 

35 0.5 . The theoretical number of RFLPs distributed along the entire human genome is more 

than 10 5 , which leads to a potential average inter-marker distance of 30 kilobases. However, 
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in reality the number of evenly distributed RFLPs which occur at a sufficient frequency in the 
population to make them useful for tracking of genetic polymorphisms is very limited. 

The second generation of genetic markers were VNTRs, which can be categorized as 
either minisatellites or microsatellites. Minisatellites are tandemly repeated DNA sequences 
5 present in units of 5-50 repeats which are distributed along regions of the human 

chromosomes ranging from 0.1 to 20 kilobases in length. Since they present many possible 
alleles, their informative content is very high. Minisatellites are scored by performing 
Southern blots to identify the number of tandem repeats present in a nucleic acid sample from 

4 

the individual being tested. However, there are only 10 potential VNTRs that can be typed 
10 by Southern blotting. Thus, the number of easily typed informative markers in these maps is 
far too small for the average distance between informative markers to fulfill the requirements 
for a useful genetic map. Moreover, both RFLP and VNTR markers are costly and time- 
consuming to develop and assay in large numbers. 

Initial attempts to construct genetic maps based_on non-RFLP biallelic markers have 
15 focused on identifying biallelic markers lying within sequence tagged sites (STS), pieces of 
genomic DNA having a known sequence and averaging about 250 bases in length. More than 
30,000 STSs have been identified and ordered along the genome (Hudson et al., Science 
270:1945-1954 (1995); Schuler et al., Science 274:540-546 (1996)). For example, the 
Whitehead Institute and Genethon's integrated map contains 15,086 STSs. 
20 These sequence tagged sites can be screened to identify polymorphisms, preferably 

Single Nucleotide Polymorphisms (SNPs), more preferably non RFLP biallelic markers 
therein. Generally polymorphisms are identified by determining the sequence of the STSs in 5 
to 10 individuals. 

Wang et al. (Cold Spring harbor laboratory: Abstracts of papers presented on genome 
25 Mapping and sequencing p.17 (May 14-18, 1997)) recently announced the identification and 
mapping of 750 Single Nucleotide Polymorphisms issued from the sequencing of 12,000 STSs 
from the Whitehead/MIT map, in eight unrelated individuals. The map was assembled using a 
high throughput system based on the utilization of DNA chip technology available from 
Affymetrix (Chee et al., Science 274:610-614 (1996)). 
30 However, according to experimental data and statistical calculations, less than one out 

of 10 of all STSs mapped today will contain an informative Single Nucleotide Polymorphism. 
This is primarily due to the short length of existing STSs (usually less than 250 bp). If one 
assumes 10 6 informative SNPs spread along the human genome, there would on average be 
one marker of interest every 3X10'/10 6 , i.e. every 3,000 bp. The probability that one such 
35 marker is present on a 250 bp stretch is thus less than 1/10. 
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While it could produce a high density map, the STS approach based on currently 
existing markers does not put any systematic effort into making sure that the markers obtained 
are optimally distributed throughout the entire genome. Instead, polymorphisms are limited to 
those locations for which STSs are available. 
5 The even distribution of markers along the chromosomes is critical to the future 

success of genetic analyses. In particular, a high density map having appropriately spaced 
markers is essential for conducting association studies on sporadic cases, aiming at identifying 
genes responsible for detectable traits such as those which are described below. 

As will be further explained below, genetic studies have mostly relied in the past on a 
1 0 statistical approach called linkage analysis, which took advantage of microsatellite markers to 
study their inheritance pattern within families from which a sufficient number of individuals 
presented the studied trait. Because of intrinsic limitations of linkage analysis, which will be 
further detailed below, and because these studies necessitate the recruitment of adequate 
.family pedigrees, they are not well suited to the genetic analysis of all traits, particularly those 
1 5 for which only sporadic cases arc available (e.g. drug response traits), or those which have a 
low penetrance within the studied population. 

Association studies enabled by the biallelic markers of the present invention offer an 
alternative to linkage analysis. Combined with the use of a high density map of appropriately 
spaced, sufficiently informative markers, association studies, including linkage 
20 disequilibrium-based genome wide association studies, will enable the identification of most 
genes involved in complex traits. 

Single nucleotide polymorphism or biallelic markers can be used in the same manner 
as RFLPs and VNTRs but offer several advantages. Single nucleotide polymorphisms are 
densely spaced in the human genome and represent the most frequent type of variation. An 
25 estimated number of more than 10 7 sites are scattered along the 3xl0 9 base pairs of the human 
genome. Therefore, single nucleotide polymorphisms occur at a greater frequency and with 
greater uniformity than RFLP or VNTR markers which means that there is a greater 
probability that such a marker will be found in close proximity to a genetic locus of interest. 
Single nucleotide polymorphisms are less variable than VNTR markers but are mutationally 
30 more stable. 

Also, the different forms of a characterized single nucleotide polymorphism, such as 
the biallelic markers of the present invention, are often easier to distinguish and can therefore 
be typed easily on a routine basis. Biallelic markers have single nucleotide based alleles and 
they have only two common alleles, which allows highly parallel detection and automated 
35 scoring. The biallelic markers of the present invention offer the possibility of rapid, high- 
throughput genotyping of a large number of individuals. 



WO 99/54500 



4 



PCT/IB99/00822 



Biallelic markers arc densely spaced in the genome, sufficiently informative and can 
be assayed in large numbers. The combined effects of these advantages make biallelic 
markers extremely valuable in genetic studies. Biallelic markers can be used in linkage 
studies in families, in allele sharing methods, in linkage disequilibrium studies in populations, 
5 in association studies of case-control populations. An important aspect of the present 

invention is that biallelic markers allow association studies to be performed to identify genes 
involved in complex traits. Association studies examine the frequency of marker alleles in 
unrelated case- and control-populations and are generally employed in the detection of 
polygenic or sporadic traits. Association studies may be conducted within the general 
10 population and are not limited to studies performed on related individuals in affected families 
(linkage studies). Biallelic markers in different genes can be screened in parallel for direct 
association with disease or response to a treatment. This multiple gene approach is a powerful 
tool for a variety of human genetic studies as it provides the necessary statistical power to 
examine the synergistic effect of multiple genetic factors on a particular phenotype, drug 
1 5 response, sporadic trait, or disease state with a complex genetic etiology. 

The present invention relates to a high density linkage disequilibrium-based genetic 
maps of the human genome which comprise the map-related biallelic markers of the invention 
and will allow the identification of genes responsible for detectable traits using genome-wide 
association studies and linkage disequilibrium mapping. 



20 



Summary of the Invention 
The present invention is based on the discovery of a set of novel map-related biallelic 
markers. See Table 1. The position of these markers and knowledge of the surrounding 
sequence has been used to design polynucleotide compositions which are useful in high 
25 density mapping of the human genome as well as in determining the identity of nucleotides at 
the marker position, and more complex association and haplotyping studies which are useful 
in determining the genetic basis for disease states. In addition, the compositions and methods 
of the invention find use in the identification of the targets for the development of 
pharmaceutical agents and diagnostic methods, as well as the characterization of the 
30 differential efficacious responses to and side effects from pharmaceutical agents acting on a 
disease as well as other treatments. 

A first embodiment of the present invention is a map of the human genome 
comprising an ordered array of biallelic markers, wherein at least 1, 2, 5, 10, 20, 25, 30, 50, 
100, 200, 500, 1000, 2000 or 3000 of said biallelic markers arc map-related biallelic markers. 
35 In addition, the maps of the present invention encompass maps with any further limitation 
described in this disclosure, or those following, specified alone or in any combination: 
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optionally, said map-related biallelic marker may be selected individually or in any 
combination from the group consisting of the biallelic markers of SEQ ID Nos. 1 to 3908, 1 to 
2260, 2261 to 3734, 3734 to 3908 and the complements thereof; optionally said ordered array 
comprises at least 20,000, 40,000, 60,000, 80,000, 100,000, or 120,000 biallelic markers; 
5 optionally, wherein said biallelic markers are separated from one another by an average 

distance of 10kb-200 kb, 15kb-150 kb, 20kb-100 kb, 100kb-150 kb, 50-100kb, or 25 kb-50 kb 
in the human genome; optionally, said biallelic markers are distributed at an average density 
of at least one biallelic marker every 150kb, 50 kb, or 30 kb in the human genome; or 
optionally, wherein, all of said biallelic markers are selected to have a heterozygosity rates of 
10 at least about 0.18, 0.32, or 0.42. 

A second embodiment of the invention encompasses isolated, purified or recombinant 
polynucleotides consisting of, consisting essentially of, or comprising a contiguous span of 
nucleotides of a sequence selected as an individual or in any combination from the group 
consisting of SEQ ED No. 1 to 3908, 1 to 2260, 2261 to 3734, 3734 to 3908, 3935 to 7842, 
15 7866 to 11773, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 10125, 10126 to 1 1599, 
and 1 1600 to 1 1773, or the complements thereof, wherein said contiguous span is at least 8, 
10, 12, 15, 18, 19, 20, 21, 22, 23, 24, 25, 30, 35, 43, 44, 45, 46 or 47 nucleotides in length, to 
the extent that a contiguous span of these lengths is consistent with the lengths of the 
particular Sequence ID. The present invention also relates to polynucleotides hybridizing 
20 under stringent or intermediate conditions to a sequence selected from the group consisting of 
SEQ ID No. 1 to 3908, 1 to 2260, 2261 to 3734, 3734 to 3908, 3935 to 7842, 7866 to 1 1773, 
3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 10125, 10126 to 1 1599, and 1 1600 to 
11773 and the complements thereof. In addition, the polynucleotides of the invention 
encompass polynucleotides with any further limitation described in this disclosure, or those 
25 following, specified alone or in any combination: said contiguous span may optionally 
comprise a map-related biallelic marker; optionally either the 1ST or the 2* D allele of the 
respective SEQ ID No., as indicated in Table 1, may be specified as being present at said map- 
related biallelic marker; optionally, said biallelic marker may be within 6, 5, 4, 3, 2, or 1 
nucleotides of the center of said polynucleotide or at the center of said polynucleotide; 
30 optionally, said polynucleotide may consists of, or consist essentially of a contiguous span 
which ranges in length from 8, 10, 12, 15, 18 or 20 to 21, 25, 35, 40, 43, or 47 nucleotides; 
optionally, said polynucleotide may consists of, or consist essentially of a contiguous span 
which ranges in length from 8, 10, 12, 15, 18 or 20 to 21, 25, 35, 40, 43, or 47 nucleotides, or 
be specified as being 12, 15,18, 20, 25, 35, 40, 43, or 47 nucleotides in length and including 
35 an map-related biallelic marker of said sequence, and optionally the 1 st allele of Table 1 is 

present at said biallelic marker; optionally, the 3' end of said contiguous span may be present 
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at the 3' end of said polynucleotide; optionally, biallelic marker may be present at the 3' end 
of said polynucleotide; optionally, the 3' end of said polynucleotide may be located within or 
at least 2, 4, 6, 8, or 10 nucleotides upstream of a map-related biallelic marker in said 
sequence, to the extent that such a distance is consistent with the lengths of the particular 
5 Sequence ID; optionally, the 3' end of said polynucleotide may be located 1 nucleotide 
upstream of a map-related biallelic marker in said sequence; and optionally, said 
polynucleotide may further comprise a label. 

A third embodiment of the invention encompasses any polynucleotide of the invention 
attached to a solid support. In addition, the polynucleotides of the invention which are 
1 0 attached to a solid support encompass polynucleotides with any further limitation described in 
this disclosure, or those following, specified alone or in any combination: optionally, said 
polynucleotides may be specified as attached individually or in groups of at least 2, 5, 8, 10, 
12, 15, 20, 25, 50, 100, 200, or 500 distinct polynucleotides of the inventions to a single solid 
support; optionally, polynucleotides other than those_of the invention may attached to the 
1 5 same solid support as polynucleotides of the invention; optionally, when multiple 

polynucleotides are attached to a solid support they may be attached at random locations, or in 
an ordered array; optionally, said ordered array may be addressable. 

A fourth embodiment of the invention encompasses the use of any polynucleotide for, 
or any polynucleotide for use in, determining the identity of nucleotides at a map-related 
20 biallelic marker. In addition, the polynucleotides of the invention for use in determining the 
identity of nucleotides at a map-related biallelic marker encompass polynucleotides with any 
further limitation described in this disclosure, or those following, specified alone or in any 
combination: optionally, said map-related biallelic marker may be selected individually or in 
any combination from the group consisting of the biallelic markers of SEQ ID No. 1 to 3908, 1 
25 to 2260, 2261 to 3734, 3734 to 3908 and the complements thereof; optionally, said 

polynucleotide may comprise a sequence disclosed in the present specification; optionally, 
said polynucleotide may consist of, or consist essentially of any polynucleotide described in 
the present specification; optionally, said determining may be performed in a hybridization 
assay, sequencing assay, microsequencing assay, or an enzyme-based mismatch detection 
30 assay; optionally, said polynucleotide may be attached to a solid support, array, or 
addressable array; optionally, said polynucleotide may be labeled. 

A fifth embodiment of the invention encompasses the use of any polynucleotide for, 
or any polynucleotide for use in, amplifying a segment of nucleotides comprising a map- 
related biallelic marker. In addition, the polynucleotides of the invention for use in amplifying 
35 a segment of nucleotides comprising a map-related biallelic marker encompass 

polynucleotides with any further limitation described in this disclosure, or those following, 
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specified alone or in any combination: optionally, said map-related biallelic marker may be 
selected individually or in any combination from the group consisting of the biallelic markers 
of SEQ ID Nos. 1 to 3908, 1 to 2260, 2261 to 3734, 3734 to 3908 and the complements 
thereof; optionally, said polynucleotide may consist of, consist essentially of, or comprise a 

5 sequence selected individually or in any combination from the group consisting of SEQ ID 

Nos. 3935 to 7842, 7866 to 1 1773, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 10125, 
10126 to 1 1599, and 1 1600 to 1 1773; optionally, said polynucleotide may consist of, or 
consist essentially of any polynucleotide described in the present specification; optionally, 
said amplifying may be performed by a PCR or LCR. Optionally, said polynucleotide may be 

1 0 attached to a solid support, array, or addressable array. Optionally, said polynucleotide may 
be labeled. 

A sixth embodiment of the invention encompasses methods of genotyping a biological 
sample comprising determining the identity of a nucleotide at a map-related biallelic marker. 
In addition, the genotyping methods of the invention encompass methods with any further 
1 5 limitation described in this disclosure, or those following, specified alone or in any 

combination: optionally, said map-related biallelic marker may be selected individually or in 
any combination from the group consisting of the biallelic markers of SEQ ID No. 1 to 3908, 1 
to 2260, 2261 to 3734, 3734 to 3908 and the complements thereof; optionally, said method 
further comprises determining the identity of a second nucleotide at said biallelic marker, 
20 wherein said first nucleotide and second nucleotide arc not base paired (by Watson & Crick 
base pairing) to one another; optionally, said biological sample is derived from a single 
individual or subject; optionally, said method is performed in vitro; optionally, said biallelic 
marker is determined for both copies of said biallelic marker present in said individual's 
genome; optionally, said biological sample is derived from multiple subjects or individuals; 
25 optionally, said method further comprises amplifying a portion of said sequence comprising 
the biallelic marker prior to said determining step; optionally, wherein said amplifying is 
performed by PCR, LCR, or replication of a recombinant vector comprising an origin of 
replication and said portion in a host cell; optionally, wherein said determining is performed 
by a hybridization assay, sequencing assay, microsequencing assay, or an enzyme-based 
30 mismatch detection assay. 

A seventh embodiment of the invention comprises methods of estimating the 
frequency of an allele in a population comprising genotyping individuals from said population 
for a map-related biallelic marker and determining the proportional representation of said 
biallelic marker in said population. In addition, the methods of estimating the frequency of an 
35 allele in a population of the invention encompass methods with any further limitation 
described in this disclosure, or those following, specified alone or in any combination: 



WO 99/54500 



8 



PCT/IB99/00822 



optionally, said map-related biallelic marker may be selected individually or in any 
combination from the group consisting of the biallelic markers of SEQ Nos. 1 to 3908, 1 to 
2260, 2261 to 3734, 3734 to 3908 and the complements thereof; optionally, determining the 
frequency of a biallelic marker allele in a population may be accomplished by determining the 
5 identity of the nucleotides for both copies of said biallelic marker present in the genome of 
each individual in said population and calculating the proportional representation of said 
nucleotide at said map-related biallelic marker for the population; optionally, determining the 
frequency of a biallelic marker allele in a population may be accomplished by performing a 
genotyping method on a pooled biological sample derived from a representative number of 
10 individuals, or each individual, in said population, and calculating the proportional amount of 
said nucleotide compared with the total. 

An eighth embodiment of the invention comprises methods of detecting an association 
between an allele and a phenotype, comprising the steps of a) determining the frequency of at 
least one map-related biallelic marker allele in a trait positive population, b) determining the 
15 frequency of said map-related biallelic marker allele in a control population and; c) 

determining whether a statistically significant association exists between said genotype and 
said phenotype. In addition, the methods of detecting an association between an allele and a 
phenotype of the invention encompass methods with any further limitation described in this 
disclosure, or those following, specified alone or in any combination: optionally, said map- 
20 related biallelic marker may be selected individually or in any combination from the group 

consisting of the biallelic markers of SEQ ID Nos. 1 to 3908, 1 to 2260, 2261 to 3734, 3734 to 
3908 and the complements thereof; optionally, said control population may be a trait-negative 
population, or a random population; optionally, wherein said phenotype is selected from the 
group consisting of disease, treatment response, treatment efficacy, drug response, drug 
25 efficacy, and drug toxicity; optionally, the determining steps a) and b) are performed on all of 
the biallelic markers of SEQ ID Nos. 1 to 3908. 

An ninth embodiment of the present invention encompasses methods of estimating the 
frequency of a haplotype for a set of biallelic markers in a population, comprising the steps of: 
a) genotyping each individual in said population for at least one map-related biallelic marker, 
30 b) genotyping each individual in said population for a second biallelic marker by determining 
the identity of the nucleotides at said second biallelic marker for both copies of said second 
biallelic marker present in the genome; and c) applying a haplotype determination method to 
the identities of the nucleotides determined in steps a) and b) to obtain an estimate of said 
frequency. In addition, the methods of estimating the frequency of a haplotype of the 
35 invention encompass methods with any further limitation described in this disclosure, or those 
following, specified alone or in any combination: optionally said haplotype determination 
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method is selected from the group consisting of asymmetric PCR amplification, double PCR 
amplification of specific alleles, the Clark method, or an expectation maximization algorithm; 
optionally, said map-related biallelic marker may be selected individually or in any 
combination from the group consisting of the biallelic markers of SEQ ID Nos. 1 to 3908, 1 to 
5 2260, 2261 to 3734, 3734 to 3908 and the complements thereof; optionally, said second 

biallelic marker is a map-related biallelic marker; optionally, the identity of the nucleotides at 
the biallelic markers in every one of the sequences of SEQ ID No. 1 to 3908 is determined in 
steps a) and b). 

A tenth embodiment of the present invention encompasses methods of detecting an 
10 association between a haplotype and a phenotype, comprising the steps of: a) estimating the 
frequency of at least one haplotype in a trait positive population according to a method of 
estimating the frequency of a haplotype of the invention; b) estimating the frequency of said 
haplotype in a control population according to the method of estimating the frequency of a 
_haplotype of the invention; and c) determining whether a statistically significant association 
■ 15 exists between said haplotype and said phenotype. In addition, the methods of detecting an 
association between a haplotype and a phenotype of the invention encompass methods with 
any further limitation described in this disclosure, or those following, specified alone or in any 
combination: optionally, said map-related biallelic marker may be in a sequence selected 
individually or in any combination from the group consisting of SEQ ID No. I to 3908, 1 to 
20 2260, 2261 to 3734, 3734 to 3908 and the complements thereof; optionally, said control 

population may be a trait-negative population, or a random population; optionally, wherein 
said phenotype is selected from the group consisting of disease, treatment response, treatment 
efficacy, drug response, drug efficacy, and drug toxicity; optionally, the identity of the 
nucleotides at the biallelic markers in every one of the following sequences: SEQ ID No. 1 to 
25 3908 is included in the estimating steps a) and b). 

An eleventh embodiment of the present invention is a method of identifying a gene 
associated with a detectable trait comprising the steps of: a) determining the frequency of each 
allele of at least one map-related biallelic marker in individuals having the detectable trait and 
individuals lacking the detectable trait; b) identifying at least one alleles of one or biallelic 
30 markers having a statistically significant association with the detectable trait; and c) 

identifying a gene in linkage disequilibrium with said allele. In addition, the methods of the 
present invention for identifying a gene associated with a detectable trait encompass methods 
with any further limitation described in this disclosure, or those following, specified alone or 
in any combination: optionally, wherein the method further comprises d) identifying a 
35 mutation in the gene identified in step c) which is associated with the detectable trait; 

optionally, wherein the individuals having the detectable trait and the individuals lacking the 
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detectable trait are readily distinguishable from one another; optionally, wherein the 
individuals having the detectable trait and the individuals lacking the detectable trait are 
selected from a bimodal population; optionally, wherein the individuals having the detectable 
trait are at one extreme of the population and the individuals lacking the detectable trait are at 
5 the other extreme of the population; optionally, said map-related biallelic marker may be in a 
sequence selected individually or in any combination from the group consisting of SEQ ID 
No. 1 to 3908, 1 to 2260, 2261 to 3734, 3734 to 3908 and the complements thereof; 
optionally, wherein said detectable trait is selected from the group consisting of disease, 
treatment response, treatment efficacy, drug response, drug efficacy, and drug toxicity. 
1 0 a twelfth embodiment of the present invention is a method of identifying biallelic 

markers associated with a detectable trait comprising the steps of: a) determining the 
frequencies of a set of biallelic markers comprising at least one map-related biallelic marker in 
individuals who express said detectable trait and individuals who do not express said 
detectable trait; and b) identifying one or more biallelic markers in said set which are 
1 5 statistically associated with the expression of said detectable trait. In addition, the methods of 
the present invention for identifying biallelic markers associated with a detectable trait 
encompass methods with any further limitation described in this disclosure, or those 
following, specified alone or in any combination: optionally, said map-related biallelic marker 
may be in a sequence selected individually or in any combination from the group consisting of 
20 SEQ ID No. 1 to 3908, 1 to 2260, 2261 to 3734, 3734 to 3908 and the complements thereof; 
optionally, wherein said detectable trait is selected from the group consisting of disease, 
treatment response, treatment efficacy, drug response, drug efficacy, and drug toxicity. 

A thirteenth embodiment of the present invention is a method of identifying biallelic 
markers) in linkage disequilibrium with a trait causing allele or in linkage disequilibrium with 
25 a trait-associated biallelic marker comprising the steps of: a) selecting at least one map-related 
biallelic marker which is in the genomic region suspected of containing the trait-causing allele 
or the trait-associated biallelic marker; and b) determining which of the map-related biallelic 
markers are associated with the trait-causing allele or in linkage disequilibrium with the trait- 
associated biallelic marker. In addition, the methods of the present invention for identifying 
30 biallelic marker(s) in linkage disequilibrium with a trait causing allele or in linkage 

disequilibrium with a trait-associated biallelic marker encompass methods with any further 
limitation described in this disclosure, or those following, specified alone or in any 
combination: optionally, said map-related biallelic marker may be in a sequence selected 
individually or in any combination from the group consisting of SEQ ID No. 1 to 3908, 1 to 
35 2260, 2261 to 3734, 3734 to 3908 and the complements thereof; optionally, wherein said 
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detectable trait is selected from the group consisting of disease, treatment response, treatment 
efficacy, drug response, drug efficacy, and drug toxicity. 

A fourteenth embodiment of the present invention is a method for determining 
whether an individual is at risk of developing a detectable trait or suffers from a detectable 
5 trait comprising the steps of: a) obtaining a nucleic acid sample from the individual; b) 
screening the nucleic acid sample with at least one map-related biallelic marker; and c) 
determining whether the nucleic acid sample contains at least one allele of said map-related 
biallelic marker statistically associated with the detectable trait. In addition, the methods of 
the present invention for determining whether an individual is at risk of developing a 
1 0 detectable trait or suffers from a detectable trait encompass methods with any further 
limitation described in this disclosure, or those following, specified alone or in any 
combination: optionally, said map-related biallelic marker may be in a sequence selected 
individually or in any combination from the group consisting of SEQ ID No. 1 to 3908, 1 to 
2260, 2261 to 3734, 3734 to 3908 and the complements thereof; optionally, wherein said 
1 5 detectable trait is selected from the group consisting of disease, treatment response, treatment 
efficacy, drug response, drug efficacy, and drug toxicity. 

A fifteenth embodiment of the present invention is a method of administering a drug 
or a treatment comprising the steps of:, a) obtaining a nucleic acid sample from an individual; 
b) determining the identity of the polymorphic base of at least one map-related biallelic 
20 marker which is associated with a positive response to the treatment or the drug; or at least 

one biallelic map-related marker which is associated with a negative response to the treatment 
or the drug; and c) administering the treatment or the drug to the individual if the nucleic acid 
sample contains said biallelic marker associated with a positive response to the treatment or 
the drug or if the nucleic acid sample lacks said biallelic marker associated with a negative 
25 response to the treatment or the drug. In addition, the methods of the present invention for 
administering a drug or a treatment encompass methods with any further limitation described 
in this disclosure, or those following, specified alone or in any combination: optionally, said 
map-related biallelic marker may be in a sequence selected individually or in any combination 
from the group consisting of SEQ ID No. 1 to 3908, 1 to 2260, 2261 to 3734, 3734 to 3908 
30 and the complements thereof; or optionally, the administering step comprises administering 
the drug or the treatment to the individual if the nucleic acid sample contains said biallelic 
marker associated with a positive response to the treatment or the drug and the nucleic acid 
sample lacks said biallelic marker associated with a negative response to the treatment or the 
drug. 

35 A sixteenth embodiment of the present invention is a method of selecting an 

individual for inclusion in a clinical trial of a treatment or drug comprising the steps of: a) 
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obtaining a nucleic acid sample from an individual; b) determining the identity of the 
polymorphic base of at least one map-related biallelic marker which is associated with a 
positive response to the treatment or the drug, or at least one map-related biallelic marker 
which is associated with a negative response to the treatment or the drug in the nucleic acid 
5 sample, and c) including the individual in the clinical trial if the nucleic acid sample contains 
said map-related biallelic marker associated with a positive response to the treatment or the 
drug or if the nucleic acid sample lacks said biallelic marker associated with a negative 
response to the treatment or the drug. In addition, the methods of the present invention for 
selecting an individual for inclusion in a clinical trial of a treatment or drug encompass 
10 methods with any further limitation described in this disclosure, or those following, specified 
alone or in any combination: optionally, said map-related biallelic marker may be in a 
sequence selected individually or in any combination from the group consisting of SEQ ID 
No. 1 to 3908, 1 to 2260, 2261 to 3734, 3734 to 3908 and the complements thereof; 
optionally, the including step comprises administering the drug or the treatment to the 
1 5 individual if the nucleic acid sample contains said biallelic marker associated with a positive 
response to the treatment or the drug and the nucleic acid sample lacks said biallelic marker 
associated with a negative response to the treatment or the drug. 

A seventeenth embodiment of the present invention is a method of identifying a gene 
associated with a detectable trait comprising the steps of: a) selecting a gene suspected of 
20 being associated with a detectable trait; and b) identifying at least one map-related biallelic 
marker within said gene which is associated with said detectable trait. In addition, the 
methods of the present invention for identifying a gene associated with a detectable trait 
encompass methods with any further limitation described in this disclosure, or those 
following, specified alone or in any combination: optionally, said map-related biallelic marker 
25 may be in a sequence selected individually or in any combination from the group consisting of 
SEQ ID No. 1 to 3908, 1 to 2260, 2261 to 3734, 3734 to 3908 and the complements thereof; 
optionally, the identifying step comprises determining the frequencies of the map-related 
biallelic marker(s) in individuals who express said detectable trait and individuals who do not 
express said detectable trait and identifying one or more biallelic markers which are 
30 statistically associated with the expression of the detectable trait. 

Additional embodiments are set forth in the Detailed Description of the Invention and 

in the Examples. 



35 



Brief Description of the Drawings 
Figure 1 is a cytogenetic map of chromosome 21. 
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Figure 2a shows the results of a computer simulation of the distribution of inter- 
marker spacing on a randomly distributed set of biallelic markers indicating the percentage of 
biallelic markers which will be spaced a given distance apart for 1, 2, or 3 markers/BAC in a 
genomic map (assuming a set of 20,000 minimally overlapping BACs covering the genome 

5 • are evaluated). 

Figure 2b shows the results of a computer simulation of the distribution of inter- 
marker spacing on a randomly distributed set of biallelic markers indicating the percentage of 
biallelic markers which will be spaced a given distance apart for 1, 3, or 6 markers/BAC in a 
genomic map (assuming a set of 20,000 minimally overlapping BACs covering the genome 

10 are evaluated). 

Figure 3 shows, for a series of hypothetical sample sizes, the p-value significance 
obtained in association studies performed using individual markers from the high-density 
biallelic map, according to various hypotheses regarding the difference of allelic frequencies 
between the trait-positive and trait-negative samples. 

j 5 Figure 4 is a hypothetical association analysis conducted with a map comprising about 

3,000 biallelic markers. 

Figure 5 is a hypothetical association analysis conducted with a map comprising about 

20,000 biallelic markers. 

Figure 6 is a hypothetical association analysis conducted with a map comprising about 

20 60,000 biallelic markers. 

Figure 7 is a haplotype analysis using biallelic markers in the Apo E region. 

Figure 8 is a simulated haplotype analysis using the biallelic markers in the Apo E 
region included in the haplotype analysis of Figure 7. 

Figure 9 shows a minimal array of overlapping clones which was chosen for further 
25 studies of biallelic markers associated with prostate cancer, the positions of STS markers 

known to map in the candidate genomic region along the contig, and the locations of biallelic 
markers along the BAC contig harboring a genomic region harboring a candidate gene 
associated with prostate cancer which were identified using the methods of the present 
invention. 

30 Figure 1 0 is a rough localization of a candidate gene for prostate cancer which was 

obtained by determining the frequencies of the biallelic markers of Figure 9 in affected and 

unaffected populations. 

Figure 1 1 is a further refinement of the localization of the candidate gene for prostate 
cancer using additional biallelic markers which were not included in the rough localization 
35 illustrated in Figure 10. 
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Figure 12 is a haplotype analysis using the biallelic markers in the genomic region of 
the gene associated with prostate cancer. 

Figure 13 is a simulated haplotype using the six markers included in haplotype 5 of 

Figure 12. 

5 Figure 14 is a block diagram of an exemplary computer system 

Figure 15 is a flow diagram illustrating one embodiment of a process 200 for comparing 
a new nucleotide or protein sequence with a database of sequences in order to determine the 
homology levels between the new sequence and the sequences in the database. 

Figure 16 is a flow diagram illustrating one embodiment of a process 250 in a 
1 0 computer for determining whether two sequences are homologous. 

Figure 17 is a flow diagram illustrating one embodiment of an identifier process 300 
for detecting the presence of a feature in a sequence. 

Brief Description Of The Sequence Listing 
j 5 SEQ ID Nos. 1 to 3908 contain nucleotide sequences comprising a portion of the map- 

related biallelic markers of the invention. 

SEQ ID Nos. 3909 to 3934 contain nucleotide sequences comprising a portion of the 
map-related biallelic markers which are shown to be associated with Alzheimer's disease, 
prostate cancer or asthma as described in the Examples. 
20 SEQ ID Nos. 3935 to 7842 contain nucleotide sequences of upstream amplification 

primers (PU) designed to amplify sequences containing the biallelic markers of SEQ ID Nos. 
1 to 3908. 

SEQ ID Nos. 7843 to 7865 contain nucleotide sequences of upstream amplification 
primers (PU) designed to amplify sequences containing the biallelic markers of SEQ ID Nos. 

25 3909 to 3934. 

SEQ ID Nos. 7866 to 1 1 773 contain nucleotide sequences of downstream 
amplification primers (RP) designed to amplify sequences containing the biallelic markers of 
SEQ ID Nos. 1 to 3908. 

SEQ ID Nos. 1 1774 to 1 1796 contain nucleotide sequences of downstream 
30 amplification primers (RP) designed to amplify sequences containing the biallelic markers of 
SEQ ID Nos. 3909 to 3934. 

Detailed Description of the Embodiments 
Before describing the invention in greater detail, the following definitions are set forth 
35 to illustrate and define the meaning and scope of the terms used to describe the invention 
herein. 
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Definitions 

As used interchangeably herein, the terms "nucleic acids" "oligonucleotides", and 
"polynucleotides" include RNA, DNA, or RNA/DNA hybrid sequences of more than one 
nucleotide in either single chain or duplex form. The term "nucleotide" as used herein as an 
5 adjective to describe molecules comprising RNA, DNA, or RNA/DNA hybrid sequences of 
any length in single-stranded or duplex form. The term "nucleotide" is also used herein as a 
noun to refer to individual nucleotides or varieties of nucleotides, meaning a molecule, or 
individual unit in a larger nucleic acid molecule, comprising a purine or pyrimidine, a ribose 
or deoxyribose sugar moiety, and a phosphate group, or phosphodiester linkage in the case of 
10 nucleotides within an oligonucleotide or polynucleotide. Although the term "nucleotide" is 
also used herein to encompass "modified nucleotides" which comprise at least one 
modifications (a) an alternative linking group, (b) an analogous form of purine, (c) an 
analogous form of pyrimidine, or (d) an analogous sugar, for examples of analogous linking 
groups, purine, pyrimidines, and sugars see for example PCT publication No. WO 95/04064. 
15 However, the polynucleotides of the invention are preferably comprised of greater than 50% 
conventional deoxyribose nucleotides, and most preferably greater than 90% conventional 
deoxyribose nucleotides. The polynucleotide sequences of the invention may be prepared by 
any known method, including synthetic, recombinant, ex vivo generation, or a combination 
thereof, as well as utilizing any purification methods known in the art. 
20 The term "purified" is used herein to describe a polynucleotide or polynucleotide 

vector of the invention which has been separated from other compounds including, but not 
limited to other nucleic acids, carbohydrates, lipids and proteins (such as the enzymes used in 
the synthesis of the polynucleotide), or the separation of covalently closed polynucleotides 
from linear polynucleotides, A polynucleotide is substantially pure when at least about 50 %, 
25 preferably 60 to 75% of a sample exhibits a single polynucleotide sequence and conformation 
(linear versus covalently close). A substantially pure polynucleotide typically comprises 
about 50 %, preferably 60 to 90% weight/weight of a nucleic acid sample, more usually about 
95%, and preferably is over about 99% pure. Polynucleotide purity or homogeneity may be 
indicated by a number of means well known in the art, such as agarose or polyacrylamide gel 
30 electrophoresis of a sample, followed by visualizing a single polynucleotide band upon 

staining the gel. For certain purposes higher resolution can be provided by using HPLC or 
other means well known in the art. 

The term "primer" denotes a specific oligonucleotide sequence which is 
complementary to a target nucleotide sequence and used to hybridize to the target nucleotide 
35 sequence. A primer serves as an initiation point for nucleotide polymerization catalyzed by 
either DNA polymerase, RNA polymerase or reverse transcriptase. 
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The term "probe" denotes a defined nucleic acid segment (or nucleotide analog 
segment, e.g., polynucleotide as defined herein) which can be used to identify a specific 
polynucleotide sequence present in samples, said nucleic acid segment comprising a 
nucleotide sequence complementary of the specific polynucleotide sequence to be identified. 
5 The terms "detectable trait" "trait" and "phenotype" are used interchangeably herein 

and refer to any visible, detectable or otherwise measurable property of an organism such as 
symptoms of, or susceptibility to a disease for example. Typically the terms "detectable trait" 
"trait" or "phenotype" are used herein to refer to symptoms of, or susceptibility to a disease; 
or to refer to an individual's response to an agent, drug, or treatment acting on a disease; or to 
1 0 refer to symptoms of, or susceptibility to side effects to an agent acting on a disease. 

The term "treatment" is used herein to encompass any medical intervention known in 
the art including, for example, the administration of pharmaceutical agents, medically 
prescribed changes in diet, or habits such as a reduction in smoking or drinking, surgery, the 
application of medical devices, and the application or reduction of certain physical conditions, 
1 5 for example, light or radiation. 

The term "allele" is used herein to refer to variants of a nucleotide sequence. A 
biallelic polymorphism has two forms; designated herein as the 1 st allele and the 2 nd allele. 
Diploid organisms may be homozygous or heterozygous for an allelic form. 

The term "heterozygosity rate" is used herein to refer to the incidence of individuals in 
20 a population, which are heterozygous at a particular allele. In a biallelic system the 

heterozygosity rate is on average equal to 2P,(1-P,), where P a is the frequency of the least 
common allele. In order to be useful in genetic studies a genetic marker should have an 
adequate level of heterozygosity to allow a reasonable probability that a randomly selected 
person will be heterozygous. 
25 The term "genotype" as used herein refers the identity of the alleles present in an 

individual or a sample. In the context of the present invention a genotype preferably refers to 
the description of the biallelic marker alleles present in an individual or a sample. The term 
"genotyping" a sample or an individual for a biallelic marker consists of determining the 
specific allele or the specific nucleotide carried by an individual at a biallelic marker. 
30 The term "mutation" as used herein refers to a difference in DNA sequence between 

or among different genomes or individuals which has a frequency below 1%. 

The term "haplotype" refers to a combination of alleles present in an individual or a 
sample. In the context of the present invention a haplotype preferably refers to a combination 
of biallelic marker alleles found in a given individual and which may be associated with a 
35 phenotype. 

The term "polymorphism" as used herein refers to the occurrence of two or more 
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alternative genomic sequences or alleles between or among different genomes or individuals. 
"Polymorphic" refers to the condition in which two or more variants of a specific genomic 
sequence can be found in a population. A "polymorphic site" is the locus at which the 
variation occurs. A single nucleotide polymorphism is a single base pair change. Typically a 
5 single nucleotide polymorphism is the replacement of one nucleotide by another nucleotide at 
the polymorphic site. Deletion of a single nucleotide or insertion of a single nucleotide, also 
give rise to single nucleotide polymorphisms. In the context of the present invention "single 
nucleotide polymorphism" preferably refers to a single nucleotide substitution. Typically, 
between different genomes or between different individuals, the polymorphic site may be 
10 occupied by two different nucleotides. 

The terms "biallelic polymorphism" and "biallelic marker" are used interchangeably 
herein to refer to a polymorphism having two alleles at a fairly high frequency in the 
population, preferably a single nucleotide polymorphism. A "biallelic marker allele" refers to 
the nucleotide variants present at a biallelic marker site. Typically the frequency of the less 
15 common allele of the biallelic markers of the present invention has been validated to be 

greater than 1%, preferably the frequency is greater than 10%, more preferably the frequency 
is at least 20% (i.e. heterozygosity rate of at least 0.32), even more preferably the frequency is 
at least 30% (i.e. heterozygosity rate of at least 0.42). A biallelic marker wherein the 
frequency of the less common allele is 30% or more is termed a "high quality biallelic 
20 marker." 

The location of nucleotides in a polynucleotide with respect to the center of the 
polynucleotide are described herein in the following manner. When a polynucleotide has an 
odd number of nucleotides, the nucleotide at an equal distance from the 3' and 5* ends of the 
polynucleotide is considered to be "at the center" of the polynucleotide, and any nucleotide 
25 immediately adjacent to the nucleotide at the center, or the nucleotide at the center itself is 

considered to be "within 1 nucleotide of the center." With an odd number of nucleotides in a 
polynucleotide any of the five nucleotides positions in the middle of the polynucleotide would 
be considered to be within 2 nucleotides of the center, and so on. When a polynucleotide has 
an even number of nucleotides, there would be a bond and not a nucleotide at the center of the 
30 polynucleotide. Thus, either of the two central nucleotides would be considered to be "within 
1 nucleotide of the center" and any of the four nucleotides in the middle of the polynucleotide 
would be considered to be "within 2 nucleotides of the center", and so on. For polymorphisms 
which involve the substitution, insertion or deletion of 1 or more nucleotides, the 
polymorphism, allele or biallelic marker is "at the center" of a polynucleotide if the difference 
35 between the distance from the substituted, inserted, or deleted polynucleotides of the 

polymorphism and the 3' end of the polynucleotide, and the distance from the substituted, 
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inserted, or deleted polynucleotides of the polymorphism and the 5' end of the polynucleotide 
is zero or one nucleotide. If this difference is 0 to 3, then the polymorphism is considered to 
be "within 1 nucleotide of the center." If the difference is 0 to 5, the polymorphism is 
considered to be "within 2 nucleotides of the center." If the difference is 0 to 7, the 
5 polymorphism is considered to be "within 3 nucleotides of the center," and so on. For 

polymorphisms which involve the substitution, insertion or deletion of 1 or more nucleotides, 
the polymorphism, allele or biallelic marker is "at the center" of a polynucleotide if the 
difference between the distance from the substituted, inserted, or deleted polynucleotides of 
the polymorphism and the 3* end of the polynucleotide, and the distance from the substituted, 
10 inserted, or deleted polynucleotides of the polymorphism and the 5' end of the polynucleotide 
is zero or one nucleotide. If this difference is 0 to 3, then the polymorphism is considered to 
be "within 1 nucleotide of the center." If the difference is 0 to 5, the polymorphism is 
considered to be "within 2 nucleotides of the center." If the difference is 0 to 7, the 
polymorphism is considered to be "within 3 nucleotides of the center," and so on. 
15 The term "upstream" is used herein to refer to a location which, is toward the 5' end of 

the polynucleotide from a specific reference point. 

The terms "base paired" and "Watson & Crick base paired" are used interchangeably 
herein to refer to nucleotides which can be hydrogen bonded to one another be virtue of their 
sequence identities in a manner like that found in double-helical DNA with thymine or uracil 
20 residues linked to adenine residues by two hydrogen bonds and cytosine and guanine residues 
linked by three hydrogen bonds (See Stryer, L., Biochemistry, 4th edition, 1995). 

The terms "complementary" or "complement thereof are used herein to refer to the 
sequences of polynucleotides which is capable of forming Watson & Crick base pairing with 
another specified polynucleotide throughout the entirety of the complementary region. This 
25 term is applied to pairs of polynucleotides based solely upon their sequences and not any 
particular set of conditions under which the two polynucleotides would actually bind. 

As used herein the term "map-related biallelic marker" relates to a biallelic marker in 
linkage disequilibrium with any of the sequences disclosed in SEQ ID Nos. 1 to 3908 which 
contain a biallelic marker of the map. The term map-related biallelic marker encompasses all 
30 of the biallelic markers disclosed in SEQ ID Nos. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 
3908. The preferred map-related biallelic marker alleles of the present invention include each 
one of the alleles selected individually or in any combination from the biallelic markers of 
SEQ ID Nos. 1 to 3908, 1 to 2260, 2261 to 3374, and 3735 to 3908, as identified in field 
<223> of the allele feature in the appended Sequence Listing, individually or in groups 
35 consisting of all the possible combinations of the alleles. 
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The terms "I st allele" and "2 nd allele" refer to the nucleotide located at the 
polymorphic base of a polynucleotide sequence containing a biallelic marker, as identified in 
field <222> of the allele feature in the appended Sequence Listing for each Sequence ID 
number. As used herein, the polymorphic base is located at nucleotide position 24 for each of 
5 SEQ ED Nos. 1 to 3908, with the exception of SEQ ID Nos. 914, 1013, 2544, 3434, 3795, and 
3028. The polymorphic base is located at nucleotide position 23 for SEQ ID Nos. 914, 1013 
and 2544, at nucleotide position 21 for SEQ ID No. 3028, at nucleotide position 20 for SEQ 
ID No. 3434. 

I. Biallelic Markers And Polynucleotides Comprising Bia llelic Markers 
10 Polynucleotides of the present invention 

The present invention encompasses polynucleotides for use as primers and probes in 
the methods of the invention. All of the polynucleotides of the invention may be specified as 
being isolated, purified or recombinant. These polynucleotides may consist of, consist 
essentially of, or comprise a contiguous span of nucleotides of a sequence from any sequence 
1 5 in the Sequence Listing as well as sequences which are complementary thereto ("complements 
thereof). The "contiguous span" may be at least 8, 10, 12, 15, 18, 19, 20, 22, 23, 24, 25, 30, 
35, 43, 44, 45, 46 or 47 nucleotides in length, to the extent that a contiguous span of these 
lengths is consistent with the lengths of the particular Sequence ED. It should be noted that the 
polynucleotides of the present invention are not limited to having the exact flanking sequences 
20 surrounding the polymorphic bases which, are enumerated in the Sequence Listing. Rather, it 
will be appreciated that the flanking sequences surrounding the biallelic markers, or any of the 
primers of probes of the invention which, are more distant from the markers, may be 
lengthened or shortened to any extent compatible with their intended use and the present 
invention specifically contemplates such sequences. It will be appreciated that the 
25 polynucleotides referred to in the Sequence Listing may be of any length compatible with their 
intended use. Also the flanking regions outside of the contiguous span need not be 
homologous to native flanking sequences which actually occur in human subjects. The 
addition of any nucleotide sequence, which is compatible with the nucleotides intended use is 
specifically contemplated. The contiguous span may optionally include the map-related 
30 biallelic marker in said sequence. Biallelic markers generally consist of a polymorphism at 
one single base position. Each biallelic marker therefore corresponds to two forms of a 
polynucleotide sequence which, when compared with one another, present a nucleotide 
modification at one position. Usually, the nucleotide modification involves the substitution of 
one nucleotide for another. Optionally either the 1 st allele or the 2 nd allele of the biallelic 
35 markers of SEQ ED Nos. 1 to 3908, 1 to 2260, 2261 to 3374, and 3735 to 3908 may be 
specified as being present at the map-related biallelic marker. 
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Preferred polynucleotides may consist of, consist essentially of, or comprise a 
contiguous span of nucleotides of a sequence from SEQ ID Nos. 1 to 2260 as well as 
sequences which are complementary thereto. The "contiguous span" may be at least 8, 10, 12, 
15, 18, 19, 20, 22, 23, 24, 25, 30, 35, 43, 44, 45, 46 or 47 nucleotides in length, to the extent 
5 that a contiguous span of these lengths is consistent with the lengths of the particular 

Sequence ID. Particularly preferred are polynucleotides which consist of, consist essentially 
of, or comprise a contiguous span of nucleotides of a sequence of any of SEQ ID Nos. 1 to 
2260, or the complements thereof, wherein the I st allele of the biallelic marker of the SEQ ID 
No. is present at the map-related biallelic marker. Other preferred polynucleotides consist of, 
10 consist essentially of, or comprise a contiguous span of nucleotides of any of SEQ ID Nos. 1 
to 2260, or the complements thereof, wherein the 2 ND allele of the biallelic marker of the SEQ 
ID No. is present at the map-related biallelic marker. Preferred polynucleotides may consist 
of, consist essentially of, or comprise a contiguous span of at least 8, 10, 12, 15, 18, 19, 20, 
22, 23, 24, 25, 30, 35, 43, 44, 45, 46 or 47 nucleotides in length, to the extent that a contiguous 
15 span of these lengths is consistent with the lengths of the particular Sequence ID No., of a 
sequence from SEQ ED Nos. 2261 to 3734 as well as sequences which arc complementary 
thereto. Particularly preferred are polynucleotides which consist of, consist essentially of, or 
comprise a contiguous span of nucleotides of a sequence of any of SEQ ID Nos. 2261 to 3734, 
or the complements thereof, wherein the 1 st allele of the biallelic marker of the SEQ ID No. is 
20 present at the map-related biallelic marker. Other preferred polynucleotides consist of, consist 
essentially of, or comprise a contiguous span of nucleotides of any of SEQ ID Nos. 2261 to 
3734, or the complements thereof, wherein the 2 nd allele of the biallelic marker of the SEQ 
ED No. is present at the map-related biallelic marker. Preferred polynucleotides may consist 
of, consist essentially of, or comprise a contiguous span of at least 8, 10, 12, 15, 18, 19, 20, 
25 22, 23, 24, 25, 30, 35, 43, 44, 45, 46 or 47 nucleotides in length, to the extent that a contiguous 
span of these lengths is consistent with the lengths of the particular Sequence ID No., of a 
sequence from SEQ ID Nos. 3735 to 3908 as well as sequences which are complementary 
thereto. Particularly preferred are polynucleotides which consist of, consist essentially of, or 
comprise a contiguous span of nucleotides of a sequence of any of SEQ ID Nos. 3735 to 3908, 
30 or the complements thereof, wherein the 1 st allele of the biallelic marker of the SEQ ID No. is 
present at the map-related biallelic marker. Other preferred polynucleotides consist of, consist 
essentially of, or comprise a contiguous span of nucleotides of any of SEQ ID Nos. 3735 to 
3908, or the complements thereof, wherein the 2 nd allele of the biallelic marker of the SEQ ID 
No. is present at the map-related biallelic marker. Also encompassed by the polynucleotides 
35 of the present invention are polynucleotides which consist of, consist essentially of, or 

comprise a contiguous span at least 8, 10, 12, 15, 18, 19, 20, 22, 23, 24, 25, 30, 35, 43, 44, 45, 
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46 or 47 nucleotides in length, to the extent that a contiguous span of these lengths is 
consistent with the lengths of the particular Sequence ID, of a sequence from SEQ ID Nos. 
1201, 3242, 3907 and 3908 as well as sequences which are complementary thereto, wherein 
said contiguous span of SEQ ID Nos. 1201 or 3242 contains a "G" at the polymorphic base, or 
5 wherein said contiguous span of SEQ ID Nos. 3907 or 3908 contain an "A" at the 
polymorphic base. 

The invention also relates to polynucleotides that hybridize, under conditions of high 
or intermediate stringency, to a polynucleotide of a sequence from any of SEQ ID Nos. 1 to 
3908, 1 to 2260, 2261 to 3374, and 3735 to 3908 as well as sequences which are 
10 complementary thereto. Preferably such polynucleotides are at least 8, 1 0, 12, 1 5, 1 8, 1 9, 20, 
22, 23, 24, 25, 30, 35, 43, 44, 45, 46 or 47 nucleotides in length, to the extent that a 
polynucleotide of these lengths is consistent with the lengths of the particular Sequence ID. 
Preferred polynucleotides comprise a map-related biallelic marker. Optionally either the 1 st 
or the 2 nd allele of the biallelic markers disclosed in the SEQ ID No. may be specified as 
1 5 being present at the map-related biallelic marker. Conditions of high and intermediate 
stringency are further described in III.C.4. 

The primers of the present invention may be designed from the disclosed sequences 
using any method known in the art. A preferred set of primers is fashioned such that the 3* 
end of the contiguous span of identity with the sequences of the Sequence Listing is present at 
20 the 3" end of the primer. Such a configuration allows the 3' end of the primer to hybridize to a 
selected nucleic acid sequence and dramatically increases the efficiency of the primer for 
amplification or sequencing reactions. 

In a preferred set of primers the contiguous span is found in one of the sequences 
described in SEQ ID Nos. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 
25 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 1 1773, 7866 to 10125, 10126 to 1 1599, 

and 1 1600 to 1 1773 or the complements thereof. The invention also relates to polynucleotides 
consisting of, consisting essentially of, or comprising a contiguous span of nucleotides of a 
sequence from SEQ ID Nos. 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 
11773, 7866 to 10125, 10126 to 1 1599, and 1 1600 to 1 1773, as well as sequences which are 
30 complementary thereto, wherein the "contiguous span" may be at least 8, 10, 12, 15, 1 8, 19, 
20, or 21 nucleotides in length, to the extent that a contiguous span of these lengths is 
consistent with the lengths of the particular Sequence ID No. 

Allele specific primers may be designed such that a biallelic marker is at the 3' end of 
the contiguous span and the contiguous span is present at the 3' end of the primer. Such allele 
35 specific primers tend to selectively prime an amplification or sequencing reaction so long as 
they are used with a nucleic acid sample that contains one of the two alleles present at a 
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biallelic marker. The 3' end of primer of the invention may be located within or at least 2, 4, 
6, 8, 10, to the extent that this distance is consistent with the particular Sequence ID, 
nucleotides upstream of a map-related biallelic marker in said sequence or at any other 
location which is appropriate for their intended use in sequencing, amplification or the 
5 location of novel sequences or markers. Primers with their 3' ends located 1 nucleotide 

upstream of a map-related biallelic marker have a special utility as microsequencing assays. 
Preferred microsequencing primers are described in SEQ ID Nos. 1 to 3908, 1 to 2260, 2261 
to 3374, and 3735 to 3908, where for each of SEQ ID Nos. 1 to 3908, 1 to 2260, 2261 to 3374, 
and 3735 to 3908, the sense microsequencing primer contains the complement of the 19 
1 0 nucleotides having their 3' ends located 1 nucleotide upstream of the polymorphic base of the 
respective SEQ ID No, and where the antisense microsequencing primer contains the 
complement of the 19 nucleotides of the complementary strand, nucleotides of the primer 
having their 3' end located 1 nucleotide upstream of the polymorphic base on the 
complementary strand to the respective SEQ ID No. The most preferred of said 
15 microsequencing primers for each of SEQ ID Nos. 1 to 3908, 1 to 2260, 2261 to 3374, and 
3735 to 3908 are microsequencing primers indicated as "A" or "S" in Table 1, which have 
been validated in microsequencing experiments. 

The probes of the present invention may be designed from the disclosed sequences for 
any method known in the art, particularly methods which allow for testing if a particular 
20 sequence or marker disclosed herein is present. A preferred set of probes may be designed for 
use in the hybridization assays of the invention in any manner known in the art such that they 
selectively bind to one allele of a biallelic marker, but not the other under any particular set of 
assay conditions. Preferred hybridization probes may consists of, consist essentially of, or 
comprise a contiguous span of SEQ ID Nos. 1 to 3908, 1 to 2260, 226 1 to 3374, and 3735 to 
25 3908, or the complement thereof, which ranges in length from least 8, 10, 12, 15, 18, 19, 20, 
22, 23, 24, 25, 30, 35, 43, 44, 45, 46 or 47 nucleotides, to the extent that a contiguous span of 
these lengths is consistent with the lengths of the particular Sequence ID No., or be specified 
as being 12, 15, 18, 19, 20, 25, 35, 40, 43, 44, 45, 46 or 47 nucleotides in length and including 
the map-related biallelic marker of said sequence. Optionally the 1st allele or 2nd allele of 
30 SEQ ID Nos. 1 to 3908, 1 to 2260, 2261 to 3374, and 3735 to 3908 may be specified as being 
present at the biallelic marker site. Optionally, said biallelic marker may be within 6, 5, 4, 3, 
2, or 1 nucleotides of the center of the hybridization probe or at the center of said probe. 

Any of the polynucleotides of the present invention can be labeled, if desired, by 
incorporating a label detectable by spectroscopic, photochemical, biochemical, 
35 immunochemical, or chemical means. For example, useful labels include radioactive 

substances, fluorescent dyes or biotin. Preferably, polynucleotides are labeled at their 3' and 
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5' ends. A label can also be used to capture the primer, so as to facilitate the immobilization of 
either the primer or a primer extension product, such as amplified DNA, on a solid support. A 
capture label is attached to the primers or probes and can be a specific binding member which 
forms a binding pair with the solid's phase reagent's specific binding member (e.g. biotin and 
5 streptavidin). Therefore depending upon the type of label carried by a polynucleotide or a 
probe, it may be employed to capture or to detect the target DNA. Further, it will be 
understood that the polynucleotides, primers or probes provided herein, may, themselves, 
serve as the capture label. For example, in the case where a solid phase reagent's binding 
member is a nucleic acid sequence, it may be selected such that it binds a complementary 
10 portion of a primer or probe to thereby immobilize the primer or probe to the solid phase. In 
cases where a polynucleotide probe itself serves as the binding member, those skilled in the art 
will recognize that the probe will contain a sequence or "tail" that is not complementary to the 
target. In the case where a polynucleotide primer itself serves as the capture label, at least a 
portion of the primer will be free to hybridize with a nucleic acid on a solid phase. DNA 
15 Labeling techniques are well known to the skilled technician. 

Any of the polynucleotides, primers and probes of the present invention can be 
conveniently immobilized on a solid support. Solid supports are known to those skilled in the 
art and include the walls of wells of a reaction tray, test tubes, polystyrene beads, magnetic 
beads, nitrocellulose strips, membranes, microparticlcs such as latex particles, sheep (or other 
20 animal) red blood cells, duracytes® and others. The solid support is not critical and can be 
selected by one skilled in the art. Thus, latex particles, microparticles, magnetic or non- 
magnetic beads, membranes, plastic tubes, walls of microtiter wells, glass or silicon chips, 
sheep (or other suitable animal's) red blood cells and duracytes are all suitable examples. 
Suitable methods for immobilizing nucleic acids on solid phases include ionic, hydrophobic, 
25 covalent interactions and the tike. A solid support, as used herein, refers to any material 

which is insoluble, or can be made insoluble by a subsequent reaction. The solid support can 
be chosen for its intrinsic ability to attract and immobilize the capture reagent. Alternatively, 
the solid phase can retain an additional receptor which has the ability to attract and 
immobilize the capture reagent. The additional receptor can include a charged substance that 
30 is oppositely charged with respect to the capture reagent itself or to a charged substance 

conjugated to the capture reagent. As yet another alternative, the receptor molecule can be 
any specific binding member which is immobilized upon (attached to) the solid support and 
which has the ability to immobilize the capture reagent through a specific binding reaction. 
The receptor molecule enables the indirect binding of the capture reagent to a solid support 
35 material before the performance of the assay or during the performance of the assay. The 

solid phase thus can be a plastic, derivatized plastic, magnetic or non-magnetic metal, glass or 
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silicon surface of a test tube, microtiter well, sheet, bead, microparticle, chip, sheep (or other 
suitable animal's) red blood cells, duracytes® and other configurations known to those of 
ordinary skill in the art. The polynucleotides of the invention can be attached to or 
immobilized on a solid support individually or in groups of at least 2, 5, 8, 10, 12, 15, 20, or 
5 25 distinct polynucleotides of the inventions to a single solid support. In addition, 

polynucleotides other than those of the invention may attached to the same solid support as 
one or more polynucleotides of the invention. 

Any polynucleotide provided herein may be attached in overlapping areas or at 
random locations on the solid support. Alternatively the polynucleotides of the invention may 
1 0 be attached in an ordered array wherein each polynucleotide is attached to a distinct region of 
the solid support which does not overlap with the attachment site of any other polynucleotide. 
Preferably, such an ordered array of polynucleotides is designed to be "addressable" where the 
distinct locations are recorded and can be accessed as part of an assay procedure. Addressable 
polynucleotide arrays typically comprise a plurality^ different oligonucleotide probes that 
15 are coupled to a surface of a substrate in different known locations. The knowledge of the 
precise location of each polynucleotides location makes these "addressable" arrays 
particularly useful in hybridization assays. Any addressable array technology known in the art 
can be employed with the polynucleotides of the invention. One particular embodiment of 
these polynucleotide arrays is known as the Genechips™, and has been generally described in 
20 US Patent 5,143,854; PCT publications WO 90/15070 and 92/10092. These arrays may 
generally be produced using mechanical synthesis methods or light directed synthesis 
methods, which incorporate a combination of photolithographic methods and solid phase 
oligonucleotide synthesis (Fodor et al., Science, 251:767-777, 1991). The immobilization of 
arrays of oligonucleotides on solid supports has been rendered possible by the development of 
25 a technology generally identified as "Very Large Scale Immobilized Polymer Synthesis" 
(VLSIPS™) in which, typically, probes are immobilized in a high density array on a solid 
surface of a chip, examples of VLSIPS™ technologies are provided in US Patents 5,143,854 
and 5,412,087 and in PCT Publications WO 90/15070, WO 92/10092 and WO 95/1 1995, 
which describe methods for forming oligonucleotide arrays through techniques such as light- 
30 directed synthesis techniques. In designing strategies aimed at providing arrays of nucleotides 
immobilized on solid supports, further presentation strategies were developed to order and 
display the oligonucleotide arrays on the chips in an attempt to maximize hybridization 
patterns and sequence information, examples of such presentation strategies are disclosed in 
PCT Publications WO 94/12305, WO 94/11530, WO 97/29212 and WO 97/31256. 
35 Oligonucleotide arrays may comprise at least one of the sequences selected 
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from the group consisting ofSEQ ID Nos. 1 to 3908, 1 to 2260, 2261 to 3374, and 
3735 to 3908 and the sequences complementary thereto, or a fragment thereof of at 
least 8, 10, 12, 15,18, 19, 20, 22, 23, 24, 25, 30, 35, 43, 44, 45, 46 or 47 consecutive 
nucleotides, to the extent that fragments of these lengths is consistent with the lengths 
of the particular Sequence ID, for determining whether a sample contains one or more 
alleles of the biallelic markers of the present invention. Oligonucleotide arrays may 
also comprise at least one of the sequences selected from the group consisting of SEQ 
ID Nos. 1 to 3908, 1 to 2260, 2261 to 3374, and 3735 to 3908, and the sequences 
complementary thereto, or a fragment thereof of at least 8, 10, 12, 15, 18, 19, 20, 22, 
23, 24, 25, 30, 35, 43, 44, 45, 46 or 47 consecutive nucleotides, to the extent that 
fragments of these lengths is consistent with the lengths of the particular Sequence ID, 
for amplifying one or more alleles of the biallelic markers of SEQ ID Nos. 1 to 3908, 
1 to 2260, 2261 to 3374, and 3735 to 3908. In other embodiments, arrays may also 
comprise at least one of the sequences selected from the group consisting of SEQ ID 
Nos. 1 to 3908, 1 to 2260, 2261 to 3374, and 3735 to 3908 and the sequences 
complementary thereto, or a fragment thereof of at 8, 10, 12, 15, 18, 19, 20, 22, 23, 24, 
25, 30, 35, 43, 44, 45, 46 or 47 consecutive nucleotides, to the extent that fragments of 
these lengths is consistent with the lengths of the particular Sequence ID, for 
conducting microsequencing analyses to determine whether a sample contains one or 
more alleles of the biallelic markers of the invention. In still further embodiments, the 
oligonucleotide array may comprise at least one of the sequences selecting from the 
group consisting of SEQ ID Nos. 1 to 3908, 1 to 2260, 2261 to 3374, and 3735 to 3908 
and the sequences complementary thereto, or a fragment thereof of at least 8, 10, 12, 
15, 18, 19, 20, 22, 23, 24, 25, 30, 35, 43, 44, 45, 46 or 47 nucleotides in length, to the 
extent that fragments of these lengths is consistent with the lengths of the particular 
Sequence ID, for determining whether a sample contains one or more alleles of the 
biallelic markers of the present invention. 

In designing strategies aimed at providing arrays of nucleotides immobilized on solid 
supports, further presentation strategies were developed to order and display the probe arrays 
) on the chips in an attempt to maximize hybridization patterns and sequence information. 
Examples of such presentation strategies are disclosed in PCT Publications WO 94/12305, 
WO 94/1 1530, WO 97/29212 and WO 97/31256. 
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Each DNA chip can contain thousands to millions of individual synthetic DNA probes 
arranged in a grid-like pattern and miniaturized to the size of a dime. In some embodiments, 
the efficiency of hybridization of nucleic acids in the sample with the probes attached to the 
chip may be improved by using polyacrylamide gel pads isolated from one another by 
hydrophobic regions in which the DNA probes are covalently linked to an acrylamide matrix. 

The polymorphic bases present in the biallelic marker or markers of the sample 
nucleic acids are determined as follows. Probes which contain at least a portion of one or 
more of the biallelic markers of the present invention are synthesized either in situ or by 
conventional synthesis and immobilized on an appropriate chip using methods known to the 
skilled technician. 

Any one or more alleles of the biallelic markers described herein (SEQ ED Nos. 1 to 
3908, 1 to 2260, 2261 to 3374, 3735 to 3908 or the sequences complementary thereto) or 
fragments thereof containing the polymorphic bases, may be fixed to a solid support, such as a 
microchip or other immobilizing surface. The fragments ofthese nucleic acids may comprise at 
least 10, at least 15, at least 20, at least 25, or more than 25 consecutive nucleotides of the 
biallelic markers described herein. Preferably, the fragments include the polymorphic bases of 
the biallelic markers. 

A nucleic acid sample is applied to the immobilizing surface and analyzed to determine 
the identities of the polymorphic bases of one or more of the biallelic markers. In some 
20 embodiments, the solid support may also include one or more of the amplification primers 
descnbed herein, or fragments comprising at least 10, at least 15, or at least 20 consecutive 
nucleotides thereof, for generating an amplification product containing the polymorphic bases of 
the biallelic markers to be analyzed in the sample. 

Another embodiment of the present invention is a solid support which includes one or 
25 more of the microsequencing primers of the invention, or fragments comprising at least 10, at 
least 15, or at least 20 consecutive nucleotides thereof and having a 3' terminus immediately 
upstream of the polymorphic base of the corresponding biallelic marker, for determining the 
identity of the polymorphic base of the one or more biallelic markers fixed to the solid support. 
For example, one embodiment of the present invention is an array of nucleic acids fixed 
30 to a solid support, such as a microchip, bead, or other immobilizing surface, comprising one or 
more of the biallelic markers in the maps of the present invention or a fragment comprising at 
least 10, at least 15, at least 20, at least 25, or more than 25 consecutive nucleotides thereof 
including the polymorphic base. For example, the array may comprise 1, 5, 10, 20, 50, 100, 
200, 500, 1000, 2000, or 3000 of the biallelic markers selected from the group consisting of 
35 SEQ ID Nos, 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908 or the sequences 

complementary thereto, or a fragment comprising at least 10, at least 15, at least 20, at least 25, 
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or more than 25 consecutive nucleotides thereof including the polymorphic base. 

Another embodiment of the present invention is an array comprising amplification 
primers for generating amplification products containing the polymorphic bases of one or more, 
at least five, at least 10, at least 20, at least 100, at least 200, at least 300, at least 400, or more 
5 than 400 of the biallelic markers in the maps of the present invention. For example, the array 
may comprise amplification primers for generating amplification products containing the 
polymorphic bases of at least 1,5, 10,20,50, 100,200,300,400,500, 1000, 2000, or 3000, of 
the biallelic markers selected from the group consisting of SEQ ID Nos.: 1 to 3908, 1 to 2260, 
2261 to 3374, 3735 to 3908 or the sequences complementary thereto. In such arrays, the 
10 amplification primers included in the array are capable of amplifying the biallelic marker 

sequences to be detected in the nucleic acid sample applied to the array (i.e. the amplification 
primers correspond to the biallelic markers affixed to the array - see Table 1). Thus, the arrays 
may include one or more of the amplification primers of SEQ ID Nos, 3935 to 7842, 7866 to 
1 1773, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 10125, 10126 to 1 1599, and 1 1600 
15 to 1 1773 corresponding to the one or more biallelic markers of SEQ ID Nos. 1 to 3908, I to 
2260, 2261 to 3374, and 3735 to 3908 which are included in the array. 

Another embodiment of the present invention is an array which includes 
microsequencing primers capable of determining the identity of the polymorphic bases of at 
least 1, 5, 10, 20, 50, 100, 200, 300, 500, 1000, 2000, or 3000 of the present invention. For 
20 exampWthe array may comprise microsequencing primers capable of determining the identity of 
the polymorphic bases of one or more, at least five, at least 10, at least 20, at least 100, at least 
200, at least 300, at least 400, or more than 400 of the biallelic markers of SEQ ID Nos. 1 to 
3908, 1 to 2260, 2261 to 3374, 3735 to 3908 or the sequences complementary thereto. 

Arrays containing any combination of the above nucleic acids which permits the 
25 specific detection or identification of the polymorphic bases of the biallelic markers in the 

maps of the present invention, including any combination ofbiallelic markers of SEQ ID Nos. 1 
to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908 or the sequences complementary thereto are 
also within the scope of the present invention. For example, the array may comprise both the 
biallelic markers and amplification primers capable of generating amplification products 
30 containing the polymorphic bases of the biallelic markers. Alternatively, the array may 

comprise both amplification primers capable of generating amplification products containing 
the polymorphic bases of the biallelic markers and microsequencing primers capable of 
determining the identities of the polymorphic bases of these markers. 

Although the above examples describe arrays comprising specific groups ofbiallelic 
35 markers and, in some embodiments, specific amplification primers and microsequencing 
primers, it will be appreciated that the present invention encompasses arrays including any 
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biallelic marker, group of biallelic markers, amplification primer, group of amplification 
primers, microsequencing primer, or group of amplification primers described herein, as 
well as any combination of the preceding nucleic acids. 

The present invention also encompasses diagnostic kits comprising one or more 

5 polynucleotides of the invention, optionally with a portion or all of the necessary reagents and 
instructions for genotyping a test subject by determining the identity of a nucleotide at a map- 
related biallelic marker. The polynucleotides of a kit may optionally be attached to a solid 
support, or be part of an array or addressable array of polynucleotides. The kit may provide 
for the determination of the identity of the nucleotide at a marker position by any method 

10 known in the art including, but not limited to, a sequencing assay method, a microsequencing 
assay method, a hybridization assay method, or an allele specific amplification method. 
Optionally such a kit may include instructions for scoring the results of the determination with 
respect to the test subjects' risk of contracting a diseases involving a disease, likely response to 
an agent acting on a disease, or chances of suffering from side effects to an agent acting on a 

15 disease. 

II. Methods For De Novo Identification Of Bia llelic Markers 

Any of a variety of methods can be used to screen a genomic fragment for single 

nucleotide polymorphisms such as differential hybridization with oligonucleotide probes, 

detection of changes in the mobility measured by gel electrophoresis or direct sequencing of 
20 the amplified nucleic acid. A preferred method for identifying biallelic markers involves 

comparative sequencing of genomic DNA fragments from an appropriate number of unrelated 

individuals. 

In a first embodiment, DNA samples from unrelated individuals are pooled together, 
following which the genomic DNA of interest is amplified and sequenced. The nucleotide 
25 sequences thus obtained are then analyzed to identify significant polymorphisms. One of the 
major advantages of this method resides in the fact that the pooling of the DNA samples 
substantially reduces the number of DNA amplification reactions and sequencing reactions, 
which must be carried out. Moreover, this method is sufficiently sensitive so that a biallelic 
marker obtained thereby usually demonstrates a sufficient frequency of its less common allele 
30 to be useful in conducting association studies. Usually, the frequency of the least common 
allele of a biallelic marker identified by this method is at least 10%. 

In a second embodiment, the DNA samples are not pooled and are therefore amplified and 
sequenced individually. This method is usually preferred when biallelic markers need to be 
identified in order to perform association studies within candidate genes. Preferably, highly 
35 relevant gene regions such as promoter regions or exon regions may be screened for biallelic 
markers. A biallelic marker obtained using this method may show a lower degree of 
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informativeness for conducting association studies, e.g. if the frequency of its less frequent 
allele may be less than about 10%. Such a biallelic marker will however be sufficiently 
informative to conduct association studies and it will further be appreciated that including less 
informative biallelic markers in the genetic analysis studies of the present invention, may 
5 allow in some cases the direct identification of causal mutations, which may, depending on 
their penetrance, be rare mutations. 

The following is a description of the various parameters of a preferred method used by the 
inventors for the identification of the biallelic markers of the present invention. 
TI.A. Genomic DNA samples 
j o xhe genomic DNA samples from which the biallelic markers of the present invention 

are generated are preferably obtained from unrelated individuals corresponding to a 
heterogeneous population of known ethnic background. The number of individuals from 
whom DNA samples are obtained can vary substantially, preferably from about 10 to about 
1000, more preferably from about 50 to about 200 individuals. Usually, DNA samples are 
15 collected from at least about 100 individuals in order to have sufficient polymorphic diversity 
in a given population to identify as many markers as possible and to generate statistically 
significant results. 

As for the source of the genomic DNA to be subjected to analysis, any test sample can 
be foreseen without any particular limitation. These test samples include biological samples, 
20 which can be tested by the methods of the present invention described herein, and include 

human and animal body fluids such as whole blood, serum, plasma, cerebrospinal fluid, urine, 
lymph fluids, and various external secretions of the respiratory, intestinal and genitourinary 
tracts, tears, saliva, milk, white blood cells, myelomas and the like; biological fluids such as 
cell culture supernatants; fixed tissue specimens including tumor and non-tumor tissue and 
25 lymph node tissues; bone marrow aspirates and fixed cell specimens. The preferred source of 
genomic DNA used in the present invention is from peripheral venous blood of each donor. 
Techniques to prepare genomic DNA from biological samples are well known to the skilled 
technician. Details of a preferred embodiment are provided in Example 27. The person 
skilled in the art can choose to amplify pooled or unpooled DNA samples. 
30 II.B. DNA Amplification 

The identification of biallelic markers in a sample of genomic DNA may be facilitated 
through the use of DNA amplification methods. DNA samples can be pooled or unpooled for 
the amplification step. DNA amplification techniques are well known to those skilled in the 
art. Various methods to amplify DNA fragments carrying biallelic markers are further 
35 described hereinafter in HI.B. The PCR technology is the preferred amplification technique 
used to identify new biallelic markers. 
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In a first embodiment, biallelic markers are identified using genomic sequence 
information generated by the inventors. Genomic DNA fragments, such as the inserts of the 
BAC clones described above, are sequenced and used to design primers for the amplification 
of 500 bp fragments. These 500 bp fragments are amplified from genomic DNA and are 

5 scanned for biallelic markers. Primers may be designed using the OSP software (Hillier L. 
and Green P., 1991). All primers may contain, upstream of the specific target bases, a 
common oligonucleotide tail that serves as a sequencing primer. Those skilled in the art are 
familiar with primer extensions, which can be used for these purposes. 

In another embodiment of the invention, genomic sequences of candidate genes are 

1 0 available in public databases allowing direct screening for biallelic markers. Preferred 

primers, useful for the amplification of genomic sequences encoding the candidate genes, 
focus on promoters, exons and splice sites of the genes. A biallelic marker present in these 
functional regions of the gene have a higher probability to be a causal mutation. 

Preferred primers include those disclosed in SEQ ID Nos. 3935 to 7842, 3935 to 6194, 

15 6195 to 7668, 7669 to 7842, 7866 to 1 1773, 7866 toU)125, 10126 to 1 1599, and 1 1600 to 
11773. 

I1C Senuencing Of Amplified Genomic DNA And Identification Of Single Nucleotide 
Polymorphisms 

The amplification products generated as described above, are then sequenced using 
20 any method known and available to the skilled technician. Methods for sequencing DNA 

using either the dideoxy-mediated method (Sanger method) or the Maxam-Gilbert method are 
widely known to those of ordinary skill in the art. Such methods are for example disclosed in 
Maniatis et al. (Molecular Cloning, A Laboratory Manual, Cold Spring Harbor Press, Second 
Edition, 1989). Alternative approaches include hybridization to high-density DNA probe 
25 arrays as described in Chee et al. (Science 274, 6 1 0, 1 996). 

Preferably, the amplified DNA is subjected to automated didcoxy terminator 
sequencing reactions using a dye-primer cycle sequencing protocol. The products of the 
sequencing reactions are run on sequencing gels and the sequences are determined using gel 
image analysis. The polymorphism search is based on the presence of superimposed peaks in 
30 the electrophoresis partem resulting from different bases occurring at the same position. 

Because each dideoxy terminator is labeled with a different fluorescent molecule, the two 
peaks corresponding to a biallelic site present distinct colors corresponding to two different 
nucleotides at the same position on the sequence. However. 'the presence of two peaks can be 
an artifact due to background noise. To exclude such an artifact, the two DNA strands are 
35 sequenced and a comparison between the peaks is carried out. In order to be registered as a 
polymorphic sequence, the polymorphism has to be detected on both strands. 
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The above procedure permits those amplification products, which contain biallelic 
markers to be identified. The detection limit for the frequency of biallelic polymorphisms 
detected by sequencing pools of 100 individuals is approximately 0.1 for the minor allele, as 
verified by sequencing pools of known allelic frequencies. However, more than 90% of the 
5 biallelic polymorphisms detected by the pooling method have a frequency for the minor allele 
higher than 0.25. Therefore, the biallelic markers selected by this method have a frequency of 
at least 0.1 for the minor allele and less than 0.9 for the major allele. Preferably at least 0.2 
for the minor allele and less than 0.8 for the major allele, more preferably at least 0.3 for the 
minor allele and less than 0.7 for the major allele, thus a heterozygosity rate higher than 0.18, 
10 preferably higher than 0.32, more preferably higher than 0.42. 

In another embodiment, biallelic markers are detected by sequencing individual DNA 
samples, the frequency of the minor allele of such a biallelic marker may be less than 0.1. 

The markers carried by the same fragment of genomic DNA, such as the insert in a 
BAC clone, need not necessarily be ordered with respect to one another within the genomic 
15 fragment to conduct association studies. However, in some embodiments of the present 

invention, the order of biallelic markers carried by the same fragment of genomic DNA are 
determined. 

TT.D. Validation of the biallelk markers of present invention 

The polymorphisms are evaluated for their usefulness as genetic markers by validating 
20 that both alleles are present in a population. Validation of the biallelic markers is 

accomplished by genotyping a group of individuals by a method of the invention and 
demonstrating that both alleles are present. Microsequencing is a preferred method of 
genotyping alleles. The validation by genotyping step may be performed on individual 
samples derived from each individual in the group or by genotyping a pooled sample derived 
25 from more than one individual. The group can be as small as one individual if that individual 
is heterozygous for the allele in question. Preferably the group contains at least three 
individuals, more preferably the group contains five or six individuals, so that a single 
validation test will be more likely to result in the validation of more of the biallelic markers 
that are being tested. It should be noted, however, that when the validation test is performed 
30 on a small group it may result in a false negative result if as a result of sampling error none of 
the individuals tested carries one of the two alleles. Thus, the validation process is less useful 
in demonstrating that a particular initial result is an artifact, than it is at demonstrating that 
there is a bona fide biallelic marker at a particular position in a sequence. All of the 
genotyping, haplotyping. association, and interaction study methods of the invention may 
35 optionally be performed solely with validated biallelic markers. 

it it Fvslnati n of th* frequency of t he hiallelic markers of the present invention 
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The validated biallelic markers are further evaluated for their usefulness as genetic 
markers by determining the frequency of the least common allele at the biallelic marker site. 
The determination of the least common allele is accomplished by genotyping a group of 
individuals by a method of the invention and demonstrating that both alleles are present. This 
5 determination of frequency by genotyping step may be performed on individual samples 
derived from each individual in the group or by genotyping a pooled sample derived from 
more than one individual. The group must be large enough to be representative of the 
population as a whole. Preferably the group contains at least 20 individuals, more preferably 
the group contains at least 50 individuals, most preferably the group contains at least 100 
10 individuals. Of course the larger the group the greater the accuracy of the frequency 

determination because of reduced sampling error. A biallelic marker wherein the frequency of 
the less common allele is 30% or more is termed a "high quality biallelic marker." All of the 
genotyping, haplotyping, association, and interaction study methods of the invention may 
optionally be performed solely with high quality biallelic markers. 
15 TIT. Method Of Genotyping An Indivi dual For Biallelic Markers 

Methods are provided to genotype a biological sample for one or more biallelic 
markers of the present invention, all of which may be performed in vitro. Such methods of 
genotyping comprise determining the identity of a nucleotide at a map-related biallelic marker 
by any method known in the art. These methods find use in genotyping case-control 
20 populations in association studies as well as individuals in the context of detection of alleles 
of biallelic markers which, are known to be associated with a given trait, in which case both 
copies of the biallelic marker present in individual's genome are determined so that an 
individual may be classified as homozygous or heterozygous for a particular allele. 

These genotyping methods can be performed nucleic acid samples derived from a 
25 single individual or pooled DNA samples. 

Genotyping can be performed using similar methods as those described above for the 
identification of the biallelic markers, or using other genotyping methods such as those further 
described below. In preferred embodiments, the comparison of sequences of amplified 
genomic fragments from different individuals is used to identify new biallelic markers 
30 whereas microsequencing is used for genotyping known biallelic markers in diagnostic and 
association study applications. 
III. A. Source of DNA for genotyping 

Any source of nucleic acids, in purified or non-purified form, can be utilized as the 
starting nucleic acid, provided it contains or is suspected of containing the specific nucleic 
35 acid sequence desired. DNA or RNA may be extracted from cells, tissues, body fluids and the 
like as described above in U.A. While nucleic acids for use in the genotyping methods of the 
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invention can be derived from any mammalian source, the test subjects and individuals from 
which nucleic acid samples are taken are generally understood to be human. 
in r Amplification Of DNA Fragments C omprising Biallelic Markers 

Methods and polynucleotides are provided to amplify a segment of nucleotides 
5 comprising one or more biallelic marker of the present invention. It will be appreciated that 
amplification of DNA fragments comprising biallelic markers may be used in various methods 
and for various purposes and is not restricted to genotyping. Nevertheless, many genotyping 
methods, although not all, require the previous amplification of the DNA region carrying the 
biallelic marker of interest. Such methods specifically increase the concentration or total 
10 number of sequences that span the biallelic marker or include that site and sequences located 
either distal or proximal to it. Diagnostic assays may also rely on amplification of DNA 
segments carrying a biallelic marker of the present invention. 

Amplification of DNA may be achieved by any method known in the art. The 
established PCR (polymerase chain reaction) method or by developments thereof or 
1 5 alternatives. Amplification methods which can be utilized herein include but are not limited 
to Ligase Chain Reaction (LCR) as described in EP A 320 308 and EP A 439 182, Gap LCR 
(Wolcott, M.J., Clin. Mcrobiol. Rev. 5:370-386), the so-called "NASBA" or "3SR" technique 
described in Guatelli J.C. et al. (Proc. Natl. Acad. Sci. USA 87:1874-1878, 1990) and in 
Compton J. (Nature 350:91-92, 1991), Q-beta amplification as described in European Patent 
20 Application no 4544610, strand displacement amplification as described in Walker et al. (Clin. 
Chern. 42:9-13, 1996) and EP A 684 315 and, target mediated amplification as described in 

PCT Publication WO 9322461. 

LCR and Gap LCR are exponential amplification techniques, both depend on DNA 
ligase to join adjacent primers annealed to a DNA molecule. In Ligase Chain Reaction (LCR), 

25 probe pairs are used which include two primary (first and second) and two secondary (third 
and fourth) probes, all of which are employed in molar excess to target. The first probe 
hybridizes to a first segment of the target strand and the second probe hybridizes to a second 
segment of the target strand, the first and second segments being contiguous so that the 
primary probes abut one another in 5' phosphate-3'hydroxyl relationship, and so that a ligase 

30 can covalently fuse or ligate the two probes into a fused product. In addition, a third 

(secondary) probe can hybridize to a portion of the first probe and a fourth (secondary) probe 
can hybridize to a portion of the second probe in a similar abutting fashion. Of course, if the 
target is initially double stranded, the secondary probes also will hybridize to the target 
complement in the first instance. Once the ligated strand of primary probes is separated from 
35 the target strand, it will hybridize with the third and fourth probes which can be ligated to 
form a complementary, secondary ligated product. It is important to realize that the ligated 
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products are functionally equivalent to either the target or its complement. By repeated cycles 
of hybridization and ligation, amplification of the target sequence is achieved. A method for 
multiplex LCR has also been described (WO 9320227). Gap LCR (GLCR) is a version of 
LCR where the probes are not adjacent but are separated by 2 to 3 bases. 
5 For amplification of mRNAs, it is within the scope of the present invention to reverse 

transcribe mRNA into cDNA followed by polymerase chain reaction (RT-PCR); or, to use a 
single enzyme for both steps as described in U.S. Patent No. 5,322,770 or, to use Asymmetric 
Gap LCR (RT-AGLCR) as described by Marshall R.L. et al. (PCR Methods and Applications 
4:80-84, 1994). AGLCR is a modification of GLCR that allows the amplification of RNA. 
j o some of these amplification methods are particularly suited for the detection of single 

nucleotide polymorphisms and allow the simultaneous amplification of a target sequence and 
the identification of the polymorphic nucleotide as it is further described in m.C. 

The PCR technology is the preferred amplification technique used in the present 
invention. A variety of PCR techniques are familiar to those skilled in the art. For a review of 
1 5 PCR technology, see Molecular Cloning to Genetic~Engineering White, B.A. Ed. in Methods in 
Molecular Biology 67: Humana Press, Totowa (1997) and the publication entitled "PCR 
Methods and Applications" (1991, Cold Spring Harbor Laboratory Press). In each of these 
PCR procedures, PCR primers on either side of the nucleic acid sequences to be amplified are 
added to a suitably prepared nucleic acid sample along with dNTPs and a thermostable 
20 polymerase such as Taq polymerase, Pfu polymerase, or Vent polymerase. The nucleic acid in 
the sample is denatured and the PCR primers are specifically hybridized to complementary 
nucleic acid sequences in the sample. The hybridized primers are extended. Thereafter, another 
cycle of denaturation, hybridization, and extension is initiated. The cycles are repeated multiple 
times to produce an amplified fragment containing the nucleic acid sequence between the primer 
25 sites. PCR has further been described in several patents including US Patents 4,683,195, 
4,683,202 and 4,965,188. 

The identification of biallelic markers as described above allows the design of 
appropriate oligonucleotides, which can be used as primers to amplify DNA fragments 
comprising the biallelic markers of the present invention. Amplification can be performed 
30 using the primers initially used to discover new biallelic markers which are described herein 
or any set of primers allowing the amplification of a DNA fragment comprising a biallelic 
marker of the present invention. Primers can be prepared by any suitable method. As for 
example, direct chemical synthesis by a method such as the phosphodiester method of Narang 
S.A. et al. (Methods Enzymol. 68:90-98, 1979), the phosphodiester method of Brown E.L. et 
35 al. (Methods Enzymol. 68:109-151, 1979), the diethylphosphoramidite method of Beaucage et 
al. (Tetrahedron Lett. 22:1859-1862, 1981) and the solid support method described in EP 0 



WO 99/54500 



35 



PCT/IB99/00822 



707 592. 

In some embodiments the present invention provides primers for amplifying a DNA 
fragment containing one or more biallelic markers of the present invention. Preferred 
amplification primers are listed in SEQ ID Nos. 3935 to 7842, 3935 to 6194, 6195 to 7668, 
5 7669 to 7842, 7866 to 1 1773, 7866 to 10125, 10126 to 1 1599, and 1 1600 to 1 1773. It will be 
appreciated that the primers listed are merely exemplary and that any other set of primers 
which produce amplification products containing one or more biallelic markers of the present 
invention. 

The primers are selected to be substantially complementary to the different strands of 
10 each specific sequence to be amplified. The length of the primers of the present invention can 
range from 8 to 100 nucleotides, preferably from 8 to 50, 8 to 30 or more preferably 8 to 25 
nucleotides. Shorter primers tend to lack specificity for a target nucleic acid sequence and 
generally require cooler temperatures to form sufficiently stable hybrid complexes with the 
template. Longer primers are expensive to produce and can sometimes self-hybridize to form 
1 5 hairpin structures. The formation of stable hybrids depends on the melting temperature (Tm) 
of the DNA. The Tm depends on the length of the primer, the ionic strength of the solution 
and the G+C content. The higher the G+C content of the primer, the higher is the melting 
temperature because G:C pairs are held by three H bonds whereas A:T pairs have only two. 
The G+C content of the amplification primers of the present invention preferably ranges 
20 between 10 and 75%, more preferably between 35 and 60%, and most preferably between 40 
and 55%. The appropriate length for primers under a particular set of assay conditions may be 
empirically determined by one of skill in the art. 

The spacing of the primers determines the length of the segment to be amplified. In 
the context of the present invention amplified segments carrying biallelic markers can range in 
25 size from at least about 25 bp to 35 kbp. Amplification fragments from 25-3000 bp are 

typical, fragments from 50-1000 bp are preferred and fragments from 100-600 bp are highly 
preferred. It will be appreciated that amplification primers for the biallelic markers may be 
any sequence which allow the specific amplification of any DNA fragment carrying the 
markers. Amplification primers may be labeled or immobilized on a solid support as 
30 described in I. 

III.C Methods of Genotvping DNA samples fo r Biallelic Markers 

Any method known in the art can be used to identify the nucleotide present at a 
biallelic marker site. Since the biallelic marker allele to be detected has been identified and 
specified in the present invention, detection will prove simple for one of ordinary skill in the 
35 art by employing any of a number of techniques. Many genotyping methods require the 

previous amplification of the DNA region carrying the biallelic marker of interest. While the 
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amplification of target or signal is often preferred at present, ultrasensitive detection methods 
which do not require amplification are also encompassed by the present genotyping methods. 
Methods well-known to those skilled in the art that can be used to detect biallelic 
polymorphisms include methods such as, conventional dot blot analyzes, single strand 
5 conformational polymorphism analysis (SSCP) described by Orita et al. (Proc. Natl. Acad. 
Sci. U.S.A 86:27776-2770, 1989), denaturing gradient gel electrophoresis (DGGE), 
heteroduplex analysis, mismatch cleavage detection, and other conventional techniques as 
described in Sheffield, V.C. et al. (Proc. Natl. Acad. Sci. USA 49:699-706, 1991), White et al. 
[Genomics 12:301-306, 1992), Grompe, M. et al. (Proc. Natl. Acad. Sci. USA 86:5855-5892, 
10 1989) and Grompe, M. (Nature Genetics 5:11 1-1 17, 1993). Another method for determining 
the identity of the nucleotide present at a particular polymorphic site employs a specialized 
exonuclease-resistant nucleotide derivative as described in US patent 4,656,127. 

Preferred methods involve directly determining the identity of the nucleotide present 
at a biallelic marker site by sequencing assay, enzyme-based mismatch detection assay, or 
15 hybridization assay. The following is a description of some preferred methods. A highly 
preferred method is the microsequencing technique. The term "sequencing assay" is used 
herein to refer to polymerase extension of duplex primer/template complexes and includes 
both traditional sequencing and microsequencing. 

1) Sequencing assays 

20 The nucleotide present at a polymorphic site can be determined by sequencing 

methods. In a preferred embodiment, DNA samples arc subjected to PCR amplification 
before sequencing as described above. DNA sequencing methods are described in I1C. 

Preferably, the amplified DNA is subjected to automated dideoxy terminator sequencing 
reactions using a dye-primer cycle sequencing protocol. Sequence analysis allows the 

25 identification of the base present at the biallelic marker site. 

2) Microsequencing assays 

In microsequencing methods, a nucleotide at the polymorphic site that is unique to one of 
the alleles in a target DNA is detected by a single nucleotide primer extension reaction. This 
method involves appropriate microsequencing primers which, hybridize just upstream of a 

30 polymorphic base of interest in the target nucleic acid. A polymerase is used to specifically 
extend the 3' end of the primer with one single ddNTP (chain terminator) complementary to 
the selected nucleotide at the polymorphic site. Next the identity of the incorporated 
nucleotide is determined in any suitable way. 

Typically, microsequencing reactions are carried out using fluorescent ddNTPs and the 

35 extended microsequencing primers are analyzed by electrophoresis on ABI 377 sequencing 
machines to determine the identity of the incorporated nucleotide as described in EP 412 583. 
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Alternatively capillary electrophoresis can be used in order to process a higher number of 
assays simultaneously. An example of a typical microsequencing procedure that can be used 
in the context of the present invention is provided in Example 8. 

Different approaches can be used to detect the nucleotide added to the 
5 microsequencing primer. A homogeneous phase detection method based on fluorescence 
resonance energy transfer has been described by Chen and Kwok (Nucleic Acids Research 
25:347-353 1997) and Chen et al. (Proc. Nail. Acad. Sci. USA 94/20 10756-10761,1997). In 
this method amplified genomic DNA fragments containing polymorphic sites are incubated 
with a S'-fluorescein-labeled primer in the presence of allelic dye-labeled 
1 0 dideoxyribonucleoside triphosphates and a modified Taq polymerase. The dye-labeled primer 
is extended one base by the dye-terminator specific for the allele present on the template. At 
the end of the genotyping reaction, the fluorescence intensities of the two dyes in the reaction 
mixture are analyzed directly without separation or purification. All these steps can be 
performed in the same tube and the fluorescence changes can be monitored in real time. 
1 5 Alternatively, the extended primer may be analyzed by MALDI-TOF Mass Spectrometry. The 
base at the polymorphic site is identified by the mass added onto the microsequencing primer 
(see Haff L.A. and Smirnov LP., Genome Research, 7:378-388, 1997). 

Microsequencing may be achieved by the established microsequencing method or by 
developments or derivatives thereof. Alternative methods include several solid-phase 
20 microsequencing techniques. The basic microsequencing protocol is the same as described 
previously, except that the method is conducted as a heterogenous phase assay, in which the 
primer or the target molecule is immobilized or captured onto a solid support. To simplify the 
primer separation and the terminal nucleotide addition analysis, oligonucleotides are attached 
to solid supports or are modified in such ways that permit affinity separation as well as 
25 polymerase extension. The 5' ends and internal nucleotides of synthetic oligonucleotides can 
be modified in a number of different ways to permit different affinity separation approaches, 
e.g., biotinylation. If a single affinity group is used on the oligonucleotides, the 
oligonucleotides can be separated from the incorporated terminator regent. This eliminates 
the need of physical or size separation. More than one oligonucleotide can be separated from 
30 the terminator reagent and analyzed simultaneously if more than one affinity group is used. 
This permits the analysis of several nucleic acid species or more nucleic acid sequence 
information per extension reaction. The affinity group need not be on the priming 
oligonucleotide but could alternatively be present on the template. For example, 
immobilization can be carried out via an interaction between biotinylated DNA and 
35 streptavidin-coated microtitration wells or avidin-coated polystyrene particles. In the same 
manner oligonucleotides or templates may be attached to a solid support in a high-density 
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format. In such solid phase microsequencing reactions, incorporated ddNTPs can be 
radiolabeled (Syvanen, Clinica Chimica Acta 226:225-236, 1994) or linked to fluorescein 
(Livak and Hainer, Human Mutation 3:379-385,1994). The detection of radiolabeled ddNTPs 
can be achieved through scintillation-based techniques. The detection of fluorescein-linked 
5 ddNTPs can be based on the binding of antifluorescein antibody conjugated with alkaline 
phosphatase, followed by incubation with a chromogenic substrate (such asp-nitrophenyl 
phosphate). Other possible reporter-detection pairs include: ddNTP linked to dinitrophenyl 
(DNP) and anti-DNP alkaline phosphatase conjugate (Harju et al., Clin. Chem. 39/1 1 2282- 
2287, 1993) or biotinylated ddNTP and horseradish peroxidase-conjugated streptavidin with 
10 o-phenylenediamine as a substrate (WO 92/15712). As yet another alternative solid-phase 
microsequencing procedure, Nyren et al. (Analytical Biochemistry 208:171-175, 1993) 
described a method relying on the detection of DNA polymerase activity by an enzymatic 
luminometric inorganic pyrophosphate detection assay (ELIDA). 

Pastinen et al. (Genome research 7:606-614, 1997) describe a method for multiplex 
15 detection of single nucleotide polymorphism in which the solid phase minisequencing 

principle is applied to an oligonucleotide array format. High-density arrays of DNA probes 
attached to a solid support (DNA chips) are further described in ffl.C.5 . 

In one aspect the present invention provides polynucleotides and methods to genotype one 
or more biailelic markers of the present invention by performing a microsequencing assay. In 
20 the preferred embodiment, it will be appreciated that any primer having a 3' end immediately 
adjacent to a polymorphic nucleotide may be used as a microsequencing primer. Similarly, it 
will be appreciated that microsequencing analysis may be performed for any biailelic marker 
or any combination of biailelic markers of the present invention. One aspect of the present 
invention is a solid support which includes one or more microsequencing primers comprising 
25 nucleotides complementary to the nucleotide sequences of SEQ ID Nos. 1 to 3908, 1 to 2260, 
2261 to 3374, and 3735 to 3908 or the complements thereof, or fragments comprising at least 
8, at least 12, at least 15, or at least 20 consecutive nucleotides thereof and having a 3' 
terminus immediately upstream of the corresponding biailelic marker, for determining the 
identity of a nucleotide at biailelic marker site. 
30 3) Mismatch detection assays based on polymerases and ligases 

In one aspect the present invention provides polynucleotides and methods to 
determine the allele of one or more biailelic markers of the present invention in a biological 
sample, by mismatch detection assays based on polymerases and/or ligases. These assays are 
based on the specificity of polymerases and ligases. Polymerization reactions places 
35 particularly stringent requirements on correct base pairing of the 3' end of the amplification 
primer and the joining of two oligonucleotides hybridized to a target DNA sequence is quite 
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sensitive to mismatches close to the ligation site, especially at the 3' end. The terms "enzyme 
based mismatch detection assay" are used herein to refer to any method of determining the 
allele of a biallelic marker based on the specificity of ligases and polymerases. Preferred 
methods are described below. Methods, primers and various parameters to amplify DNA 
5 fragments comprising biallelic markers of the present invention are further described above in 

m.B. 

Allele specific amplification 

Discrimination between the two alleles of a biallelic marker can also be achieved by 
allele specific amplification, a selective strategy, whereby one of the alleles is amplified 
10 without amplification of the other allele. This is accomplished by placing a polymorphic base 
at the 3' end of one of the amplification primers. Because the extension forms from the 3'end 
of the primer, a mismatch at or near this position has an inhibitory effect on amplification. 
Therefore, under appropriate amplification conditions, these primers only direct amplification 
on their complementary allele. Designing the appropriate allele-specific primer and the 
15 "corresponding assay conditions are well with the ordinary skill in the art. 
Ligation/amplification based methods 

The "Oligonucleotide Ligation Assay" (OLA) uses two oligonucleotides which are 
designed to be capable of hybridizing to abutting sequences of a single strand of a target 
molecules. One of the oligonucleotides is biotinylated, and the other is detectably labeled. If 
20 the precise complementary sequence is found in a target molecule, the oligonucleotides will 
hybridize such that their termini abut, and create a ligation substrate that can be captured and 
detected. OLA is capable of detecting biallelic markers and may be advantageously combined 
with PCR as described by Nickerson D.A. et al. (Proc. Natl. Acad. Set. U.S.A. 87:8923-8927, 
1990). In this method, PCR is used to achieve the exponential amplification of target DNA, 
25 which is then detected using OLA. 

Other methods which are particularly suited for the detection of biallelic markers 
include LCR (ligase chain reaction), Gap LCR (GLCR) which are described above in m.B. 
As mentioned above LCR uses two pairs of probes to exponentially amplify a specific target. 
The sequences of each pair of oligonucleotides, is selected to permit the pair to hybridize to 
30 abutting sequences of the same strand of the target. Such hybridization forms a substrate for a 
template-dependant ligase. In accordance with the present invention, LCR can be performed 
with oligonucleotides having the proximal and distal sequences of the same strand of a 
biallelic marker site. In one embodiment, either oligonucleotide will be designed to include 
the biallelic marker site. In such an embodiment, the reaction conditions are selected such that 
35 the oligonucleotides can be ligated together only if the target molecule either contains or lacks 
the specific nucleotide(s) that is complementary to the biallelic marker on the oligonucleotide. 
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In an alternative embodiment, the oligonucleotides will not include the biallelic marker, such 
that when they hybridize to the target molecule, a "gap" is created as described in WO 
90/01069. This gap is then "filled" with complementary dNTPs (as mediated by DNA 
polymerase), or by an additional pair of oligonucleotides. Thus at the end of each cycle, each 
5 single strand has a complement capable of serving as a target during the next cycle and 
exponential allele-specific amplification of the desired sequence is obtained. 

Ligase/Polymerase-mediated Genetic Bit Analysis™ is another method for 
determining the identity of a nucleotide at a preselected site in a nucleic acid molecule (WO 
95/21271). This method involves the incorporation of a nucleoside triphosphate that is 
10 complementary to the nucleotide present at the preselected site onto the terminus of a primer 
molecule, and their subsequent ligation to a second oligonucleotide. The reaction is monitored 
by detecting a specific label attached to the reaction's solid phase or by detection in solution. 
4) Hybridization assay methods 

A preferred method of determining the identity of the nucleotide present at a biallelic 
15 marker site involves nucleic acid hybridization. The hybridization probes, which can be 
conveniently used in such reactions, preferably include the probes defined herein. Any 
hybridization assay may be used including Southern hybridization, Northern hybridization, dot 
blot hybridization and solid-phase hybridization (see Sambrook et al., Molecular Cloning- A 
Laboratory Manual, Second Edition, Cold Spring Harbor Press, N.Y., 1989). 
20 Hybridization refers to the formation of a duplex structure by two single stranded 

nucleic acids due to complementary base pairing. Hybridization can occur between exactly 
complementary nucleic acid strands or between nucleic acid strands that contain minor regions 
of mismatch. Specific probes can be designed that hybridize to one form of a biallelic marker 
and not to the other and therefore are able to discriminate between different allelic forms. 
25 Allele-specific probes are often used in pairs, one member of a pair showing perfect match to 
a target sequence containing the original allele and the other showing a perfect match to the 
target sequence containing the alternative allele. Hybridization conditions should be 
sufficiently stringent that there is a significant difference in hybridization intensity between 
alleles, and preferably an essentially binary response, whereby a probe hybridizes to only one 
30 of the alleles. Stringent, sequence specific hybridization conditions, under which a probe will 
hybridize only to the exactly complementary target sequence are well known in the art 
(Sambrook et al., Molecular Cloning - A Laboratory Manual, Second Edition, Cold Spring 
Harbor Press, N.Y., 1989). Stringent conditions are sequence dependent and will be different 
in different circumstances. Generally, stringent conditions are selected to be about 5°C lower 
35 than the thermal melting point (Tm) for the specific sequence at a defined ionic strength and 
pH. By way of example and not limitation, procedures using conditions of high stringency are 
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as follows: Prehybridization of filters containing DN A is carried out for 8 h to overnight at 
65°C in buffer composed of 6X SSC, 50 mM Tris-HCl (pH 7.5), 1 mM EDTA, 0.02% PVP, 
0.02% Ficoll, 0.02% BSA, and 500 ug/ml denatured salmon sperm DNA. Filters arc 
hybridized for 48 h at 65°C, the preferred hybridization temperature, in prehybridization 
5 mixture containing 100 ug/ml denatured salmon sperm DNA and 5-20 X 10 6 cpm of 

32 P-labeled probe. Alternatively, the hybridization step can be performed at 65°C in the 
presence of SSC buffer, 1 x SSC corresponding to 0.1 5M NaCl and 0.05 M Na citrate. 
Subsequently, filter washes can be done at 37°C for 1 h in a solution containing 2X SSC, 
0.01% PVP, 0.01% Ficoll, and 0.01% BSA, followed by a wash in 0.1X SSC at 50°C for 45 
10 min. Alternatively, filter washes can be performed in a solution containing 2 x SSC and 0. 1% 
SDS, or 0.5 x SSC and 0.1% SDS, or 0.1 x SSC and 0.1% SDS at 68°C for 15 minute 
intervals. Following the wash steps, the hybridized probes are detectable by autoradiography. 
By way of example and not limitation, procedures using conditions of intermediate stringency 
arc as follows: Filters containing DNA are prehybridized, and then hybridized at a temperature 
15 of 60°C in the presence of a 5 x SSC buffer and labeled probe. Subsequently, filters washes 
are performed in a solution containing 2x SSC at 50°C and the hybridized probes are 
detectable by autoradiography. Other conditions of high and intermediate stringency which 
may be used are well known in the art and as cited in Sambrook et al. (Molecular Cloning - A 
Laboratory Manual, Second Edition, Cold Spring Harbor Press, N.Y., 1989) and Ausubel et al. 
20 (Current Protocols in Molecular Biology, Green Publishing Associates and Wiley 
Interscience, N.Y., 1989). 

Although such hybridizations can be performed in solution, it is preferred to employ a 
solid-phase hybridization assay. The target DNA comprising a biallelic marker of the present 
invention may be amplified prior to the hybridization reaction. The presence of a specific 
25 allele in the sample is determined by detecting the presence or the absence of stable hybrid 

duplexes formed between the probe and the target DNA. The detection of hybrid duplexes can 
be carried out by a number of methods. Various detection assay formats are well known which 
utilize detectable labels bound to either the target or the probe to enable detection of the 
hybrid duplexes. Typically, hybridization duplexes are separated from unhybridized nucleic 
30 acids and the labels bound to the duplexes are then detected. Those skilled in the art will 
recognize that wash steps may be employed to wash away excess target DNA or probe. 
Standard heterogeneous assay formats are suitable for detecting the hybrids using the labels 
present on the primers and probes. 

Two recently developed assays allow hybridization-based allele discrimination with 
35 no need for separations or washes (see Landegren U. et al., Genome Research, 8:769- 
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776,1998). The TaqMan assay takes advantage of the 5' nuclease activity of Taq DNA 
polymerase to digest a DNA probe annealed specifically to the accumulating amplification 
product. TaqMan probes are labeled with a donor-acceptor dye pair that interacts via 
fluorescence energy transfer. Cleavage of the TaqMan probe by the advancing polymerase 
5 during amplification dissociates the donor dye from the quenching acceptor dye, greatly 

increasing the donor fluorescence. All reagents necessary to detect two allelic variants can be 
assembled at the beginning of the reaction and the results are monitored in real time (see 
Livak et al., Nature Genetics, 9:341-342, 1995). In an alternative homogeneous hybridization- 
based procedure, molecular beacons are used for allele discriminations. Molecular beacons 
1 0 are hairpin-shaped oligonucleotide probes that report the presence of specific nucleic acids in 
homogeneous solutions. When they bind to their targets they undergo a conformational 
reorganization that restores the fluorescence of an internally quenched fluorophore (Tyagi et 
al., Nature Biotechnology, 16:49-53, 1998). 

The polynucleotides provided herein can be used in hybridization assays for the 
1 5 detection of biallelic marker alleles in biological samples. These probes are characterized in 
that they preferably comprise between 8 and 50 nucleotides, and in that they are sufficiently 
complementary to a sequence comprising a biallelic marker of the present invention to 
hybridize thereto and preferably sufficiently specific to be able to discriminate the targeted 
sequence for only one nucleotide variation. The GC content in the probes of the invention 
20 usually ranges between 1 0 and 75 %, preferably between 35 and 60 %, and more preferably 
between 40 and 55 %. The length of these probes can range from 10, 15, 20, or 30 to at least 
100 nucleotides, preferably from 10 to 50, more preferably from 18 to 35 nucleotides. A 
particularly preferred probe is 25 nucleotides in length. Preferably the biallelic marker is 
within 4 nucleotides of the center of the polynucleotide probe. In particularly preferred 
25 probes the biallelic marker is at the center of said polynucleotide. Shorter probes may lack 

specificity for a target nucleic acid sequence and generally require cooler temperatures to form 
sufficiently stable hybrid complexes with the template. Longer probes are expensive to 
produce and can sometimes self-hybridize to form hairpin structures. Methods for the 
synthesis of oligonucleotide probes have been described above and can be applied to the 
3 0 probes of the present invention . 

Preferably the probes of the present invention are labeled or immobilized on a solid 
support. Labels and solid supports are further described in I. Detection probes are generally 
nucleic acid sequences or uncharged nucleic acid analogs such as, for example peptide nucleic 
acids which are disclosed in International Patent Application WO 92/20702, morpholino 
35 analogs which are described in U.S. Patents Numbered 5,185,444; 5,034,506 and 5,142,047. 
The probe may have to be rendered "non-extendable" in that additional dNTPs cannot be 



WO 99/54500 



43 



PCT/IB99/00822 



added to the probe. In and of themselves analogs usually are non-extendable and nucleic acid 
probes can be rendered non-extendable by modifying the 3* end of the probe such that the 
hydroxyl group is no longer capable of participating in elongation. For example, the 3' end of 
the probe can be functionalized with the capture or detection label to thereby consume or 

5 otherwise block the hydroxyl group. Alternatively, the 3' hydroxyl group simply can be 

cleaved, replaced or modified, U.S. Patent Application Serial No. 07/049,061 filed April 19, 
1993 describes modifications, which can be used to render a probe non-extendable. 

The probes of the present invention are useful for a number of purposes. They can be 
used in Southern hybridization to genomic DNA or Northern hybridization to mRNA. The 

10 probes can also be used to detect PCR amplification products. By assaying the hybridization 
to an allele specific probe, one can detect the presence or absence of a biallelic marker allele 
in a given sample. 

High-Throughput parallel hybridizations in array format are specifically encompassed 
within "hybridization assays" and are described below. 

15 Hybridization to addressable arrays of oligonucleotides 

Hybridization assays based on oligonucleotide arrays rely on the differences in 
hybridization stability of short oligonucleotides to perfectly matched and mismatched target 
sequence variants. Efficient access to polymorphism information is obtained through a basic 
structure comprising high-density arrays of oligonucleotide probes attached to a solid support 

20 (the chip) at selected positions. Each DNA chip can contain thousands to millions of 

individual synthetic DNA probes arranged in a grid-like pattern and miniaturized to the size of 
a dime. 

The chip technology has already been applied with success in numerous cases. For 
example, the screening of mutations has been undertaken in the BRCA1 gene, in S. cerevisiae 

25 mutant strains, and in the protease gene of HIV-1 virus (Hacia et al., Nature Genetics, 

14(4):44 1-447, 1996; Shoemaker et al., Nature Genetics, 14(4):450-456, 1996 ; Kozal et al., 
Nature Medicine, 2:753-759, 1996). Chips of various formats for use in detecting biallelic 
polymorphisms can be produced on a customized basis by Affymetrix (GeneChip™), Hyseq 
(HyChip and HyGnostics), and Protogene Laboratories. 

30 In general, these methods employ arrays of oligonucleotide probes that are 

complementary to target nucleic acid sequence segments from an individual which, target 
sequences include a polymorphic marker. EP785280 describes a tiling strategy for the 
detection of single nucleotide polymorphisms. Briefly, arrays may generally be "tiled" for a 
large number of specific polymorphisms. By "tiling" is generally meant the synthesis of a 

35 defined set of oligonucleotide probes which is made up of a sequence complementary to the 

target sequence of interest, as well as preselected variations of that sequence, e.g., substitution 
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of one or more given positions with one or more members of the basis set of monomers, i.e. 
nucleotides. Tiling strategies are further described in PCT application No. WO 95/1 1995. In 
a particular aspect, arrays are tiled for a number of specific, identified biallelic marker 
sequences. In particular the array is tiled to include a number of detection blocks, each 
5 detection block being specific for a specific biallelic marker or a set of biallelic markers. For 
example, a detection block may be tiled to include a number of probes, which span the 
sequence segment that includes a specific polymorphism. To ensure probes that are 
complementary to each allele, the probes are synthesized in pairs differing at the biallelic 
marker. In addition to the probes differing at the polymorphic base, monosubstituted probes 
10 are also generally tiled within the detection block. These monosubstituted probes have bases 
at and up to a certain number of bases in either direction from the polymorphism, substituted 
with the remaining nucleotides (selected from A, T, G, C and U). Typically the probes in a 
tiled detection block will include substitutions of the sequence positions up to and including 
those that are 5 bases away from the biallelic marker. The monosubstituted probes provide 
15 internal controls for the tiled array, to distinguish actual hybridization from artefactual cross- 
hybridization. Upon completion of hybridization with the target sequence and washing of the 
array, the array is scanned to determine the position on the array to which the target sequence 
hybridizes. The hybridization data from the scanned array is then analyzed to identify which 
allele or alleles of the biallelic marker are present in the sample. Hybridization and scanning 
20 may be carried out as described in PCT application No. WO 92/10092 and WO 95/1 1995 and 
US patent No. 5,424,186. 

Thus, in some embodiments, the chips may comprise an array of nucleic acid 
sequences of fragments of about 15 nucleotides in length. In further embodiments, the chip 
may comprise an array including at least one of the sequences selected from the group 
25 consisting of SEQ ID No. 1 to 3908, 1 to 2260, 2261 to 3374, and 3735 to 3908 and the 
sequences complementary thereto, or a fragment thereof at least about 8 consecutive 
nucleotides, preferably 10, 15, 20, more preferably least 30, 35, 43, 44, 45, 46 or 47 
consecutive nucleotides, to the extent that a contiguous span of these lengths is consistent with 
the lengths of the particular Sequence ID. In some embodiments, the chip may comprise an 
30 array of at least 2, 3, 4, 5, 6, 7, 8 or more of these polynucleotides of the invention. Solid 
supports and polynucleotides of the present invention attached to solid supports are further 
described in I. 
5) Integrated Systems 

Another technique, which may be used to analyze polymorphisms, includes 
35 multicomponent integrated systems, which miniaturize and compartmentalize processes such 
as PCR and capillary electrophoresis reactions in a single functional device. An example of 
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such technique is disclosed in US patent 5,589,136, which describes the integration of PCR 
amplification and capillary electrophoresis in chips. 

Integrated systems can be envisaged mainly when microfluidic systems are used. 
These systems comprise a pattern of microchannels designed onto a glass, silicon, quartz, or 
5 plastic wafer included on a microchip. The movements of the samples are controlled by 

electric, electroosmotic or hydrostatic forces applied across different areas of the microchip. 
For genotyping biallelic markers, the microfluidic system may integrate nucleic acid 
amplification, microsequencing, capillary electrophoresis and a detection method such as 
laser-induced fluorescence detection. 
10 IV, Methods Of Genetic Analysis Using The Biallelic Markers Of The Pres ent Invention 
Different methods are available for the genetic analysis of complex traits (see Lander 
and Schork, Science, 265, 2037-2048, 1994). The search for disease-susceptibility genes is 
conducted using two main methods: the linkage approach in which evidence is sought for 
cosegregation between a locus and a putative trait locus using family studies, and the 
15 'association approach in which evidence is sought for a statistically significant association 
between an allele and a trait or a trait causing allele (Khoury J. et al, Fundamentals of 
Genetic Epidemiology t Oxford University Press, NY, 1 993). In general, the biallelic markers 
of the present invention find use in any method known in the art to demonstrate a statistically 
significant correlation between a genotype and a phenotype. The biallelic markers may be 
20 used in parametric and non-parametric linkage analysis methods. Preferably, the biallelic 
markers of the present invention are used to identify genes associated with detectable traits 
using association studies, an approach which does not require the use of affected families and 
which permits the identification of genes associated with complex and sporadic traits. 
The genetic analysis using the biallelic markers of the present invention may be 
25 conducted on any scale. The whole set of biallelic markers of the present invention or any 
subset of biallelic markers of the present invention may be used. In some embodiments a 
subset of biallelic markers corresponding to one or several candidate genes may be used. In 
other embodiments a subset of biallelic markers corresponding to candidate genes from a 
particular disease pathway may be used. Alternatively, a subset of biallelic markers of the 
30 present invention localised on a specific chromosome segment may be used. Further, any set 
of genetic markers including a biallelic marker of the present invention may be used. A set of 
biallelic polymorphisms that, could be used as genetic markers in combination with the 
biallelic markers of the present invention, has been described in WO 98/20165. As mentioned 
above, it should be noted that the biallelic markers of the present invention may be included in 
35 any complete or partial genetic map of the human genome. These different uses are 
specifically contemplated in the present invention and claims. 
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tv a l inkage analysis 

Linkage analysis is based upon establishing a correlation between the transmission of 
genetic markers and that of a specific trait throughout generations within a family. Thus, the 
aim of linkage analysis is to detect marker loci that show cosegregation with a trait of interest 
5 in pedigrees. 

Parametric methods 

When data are available from successive generations there is the opportunity to study 
the degree of linkage between pairs of loci. Estimates of the recombination fraction enable 
loci to be ordered and placed onto a genetic map. With loci that are genetic markers, a genetic 
10 map can be established, and then the strength of linkage between markers and traits can be 
calculated and used to indicate the relative positions of markers and genes affecting those 
traits (Weir, B.S., Genetic data Analysis II: Methods for Discrete population genetic Data, 
Sinauer Assoc.. Inc., Sunderland, MA, USA, 1996). The classical method for linkage analysis 
is the logarithm of odds (lod) score method (see Morton N.E., Am.J. Hum.Genet., 7:277-318, 
15 1955; Ott J., Analysis of Human Genetic Linkage, John Hopkins University Press, Baltimore, 
1991). Calculation of lod scores requires specification of the mode of inheritance for the 
disease (parametric method). Generally, the length of the candidate region identified using 
linkage analysis is between 2 and 20Mb. Once a candidate region is identified as described 
above, analysis of recombinant individuals using additional markers allows further delineation 
20 of the candidate region. Linkage analysis studies have generally relied on the use of a 

maximum of 5,000 microsatellite markers, thus limiting the maximum theoretical attainable 
resolution of linkage analysis to about 600 kb on average. 

Linkage analysis has been successfully applied to map simple genetic traits that show 
clear Mendelian inheritance patterns and which have a high penetrance (i.e., the ratio between 
25 the number of trait positive carriers of allele a and the total number of a carriers in the 
population). However, parametric linkage analysis suffers from a variety of drawbacks. 
First, it is limited by its reliance on the choice of a genetic model suitable for each studied 
trait. Furthermore, as already mentioned, the resolution attainable using linkage analysis is 
limited, and complementary studies are required to refine the analysis of the typical 2Mb to 
30 20Mb regions initially identified through linkage analysis. In addition, parametric linkage 
analysis approaches have proven difficult when applied to complex genetic traits, such as 
those due to the combined action of multiple genes and/or environmental factors. It is very 
difficult to model these factors adequately in a lod score analysis. In such cases, too large an 
effort and cost are needed to recruit the adequate number of affected families required for 
35 applying linkage analysis to these situations, as recently discussed by Risch, N. and 
Merikangas, K. (Science, 273:1516-1517, 1996). 
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Non-parametric methods 

The advantage of the so-called non-parametric methods for linkage analysis is that 
they do not require specification of the mode of inheritance for the disease, they tend to be 
more useful for the analysis of complex traits. In non-parametric methods, one tries to prove 

5 that the inheritance pattern of a chromosomal region is not consistent with random Mendelian 
segregation by showing that affected relatives inherit identical copies of the region more often 
than expected by chance. Affected relatives should show excess "allele sharing" even in the 
presence of incomplete penetrance and polygenic inheritance. In non-parametric linkage 
analysis the degree of agreement at a marker locus in two individuals can be measured either 

1 0 by the number of alleles identical by state (IBS) or by the number of alleles identical by 
descent (D3D). Affected sib pair analysis is a well-known special case and is the simplest 

form of these methods. 

The biallelic markers of the present invention may be used in both parametric and 
non-parametric linkage analysis. Preferably biallelic markers may be used in non-parametric 
1 5 methods which allow the mapping of genes involved in complex traits. The biallelic markers 
of the present invention may be used in both IBD- and IBS- methods to map genes affecting a 
complex trait. In such studies, taking advantage of the high density of biallelic markers, 
several adjacent biallelic marker loci may be pooled to achieve the efficiency attained by 
multi-allelic markers (Zhao et al., Am. J. Hum. Genet., 63:225-240, 1998). 
20 However, both parametric and non-parametric linkage analysis methods analyse 

affected relatives, they tend to be of limited value in the genetic analysis of drug responses or 
in the analysis of side effects to treatments. This type of analysis is impractical in such cases 
due to the lack of availability of familial cases. In fact, the likelihood of having more than one 
individual in a family being exposed to the same drug at the same time is extremely low. 
25 IV.B. Population Association studies 

The present invention comprises methods for identifying one or several genes among a 
set of candidate genes that are associated with a detectable trait using the biallelic markers of 
the present invention. In one embodiment the present invention comprises methods to detect 
an association between a biallelic marker allele or a biallelic marker haplotype and a trait. 
30 Further, the invention comprises methods to identify a trait causing allele in linkage 
disequilibrium with any biallelic marker allele of the present invention. 

As described above, alternative approaches can be employed to perform association 
studies: genome-wide association studies, candidate region association studies and candidate 
gene association studies. In a preferred embodiment, the biallelic markers of the present 
35 invention are used to perform candidate gene association studies. Further, the biallelic 

markers of the present invention may be incorporated in any map of genetic markers of the 
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human genome in order to perform genome-wide association studies. Methods to generate a 
high-density map of biallelic markers has been described in US Provisional Patent application 
serial number 60/082,614. The biallelic markers of the present invention may further be 
incorporated in any map of a specific candidate region of the genome (a specific chromosome 

5 or a specific chromosomal segment for example). 

As mentioned above, association studies may be conducted within the general 
population and are not limited to studies performed on related individuals in affected families. 
Association studies are extremely valuable as they permit the analysis of sporadic or 
multifactor traits. Moreover, association studies represent a powerful method for fine-scale 

10 mapping enabling much finer mapping of trait causing alleles than linkage studies. Studies 
based on pedigrees often only narrow the location of the trait causing allele. Association 
studies using the biallelic markers of the present invention can therefore be used to refine the 
location of a trait causing allele in a candidate region identified by Linkage Analysis methods. 
Moreover, once a chromosome segment of interest has been identified, the presence of a 

15 candidate gene such as a candidate gene of the present invention, in the region of interest can 
provide a shortcut to the identification of the trait causing allele. Biallelic markers of the 
present invention can be used to demonstrate that a candidate gene is associated with a trait. 
Such uses are specifically contemplated in the present invention and claims. 
1) Determining the frequency of a biallelic marker allele or of a biallelic marker 

20 haplotype in a population 

Association studies explore the relationships among frequencies for sets of alleles 

between loci. 

Determining the frequency of an allele in a population 

Allelic frequencies of the biallelic markers in a population can be determined using 
25 one of the methods described above under the heading "Methods for genotyping an individual 
for biallelic markers", or any genotyping procedure suitable for this intended purpose. 
Genotyping pooled samples or individual samples can determine the frequency of a biallelic 
marker allele in a population. One way to reduce the number of genotypings required is to use 
pooled samples. A major obstacle in using pooled samples is in terms of accuracy and 
30 reproducibility for determining accurate DNA concentrations in setting up the pools. 

Genotyping individual samples provides higher sensitivity, reproducibility and accuracy and; 
is the preferred method used in the present invention. Preferably, each individual is genotyped 
separately and simple gene counting is applied to determine the frequency of an allele of a 
biallelic marker or of a genotype in a given population. 
35 Determining the frequency of a haplotype in a population 

The gametic phase of haplotypes is unknown when diploid individuals are 
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heterozygous at more than one locus. Using genealogical information in families gametic 
phase can sometimes be inferred (Perlin et al., Am. J. Hum. Genet.. 55:777-787, 1994). When 
no genealogical information is available different strategies may be used. One possibility is 
that the multiple-site heterozygous diploids can be eliminated from the analysis, keeping only 
5 the homozygotes and the single-site heterozygote individuals, but this approach might lead to 
a possible bias in the sample composition and the underestimation of low-frequency 
haplotypes. Another possibility is that single chromosomes can be studied independently, for 
example, by asymmetric PCR amplification (see Newton et al., Nucleic Acids Res., 17:2503- 
2516, 1989; Wu et al., Proc. Natl. Acad. Sci. USA, 86:2757, 1989) or by isolation of single 
10 chromosome by limit dilution followed by PCR amplification (see Ruano et al., Proc. Natl. 
Acad. Sci. USA, 87:6296-6300, 1990). Further, a sample may be haplotyped for sufficiently 
close biallelic markers by double PCR amplification of specific alleles (Sarkar, G. and 
Sommer S.S., Biotechniques, 1991). These approaches are not entirely satisfying either 
because of their technical complexity, the additional cost they entail, their lack of 
1 5 generalisation at a large scale, or the possible biases they introduce. To overcome these 

difficulties, an algorithm to infer the phase of PCR-amplified DNA genotypes introduced by 
Clark A.G. {Mol. Biol. Evol., 7: 1 1 1-122, 1990) may be used. Briefly, the principle is to start 
filling a preliminary list of haplotypes present in the sample by examining unambiguous 
individuals, that is, the complete homozygotes and the single-site heterozygotes. Then other 
20 individuals in the same sample are screened for the possible occurrence of previously 

recognised haplotypes. For each positive identification, the complementary haplotype is added 
to the list of recognised haplotypes, until the phase information for all individuals is either 
resolved or identified as unresolved. This method assigns a single haplotype to each 
multiheterozygous individual, whereas several haplotypes are possible when there are more 
25 than one heterozygous site. Alternatively, one can use methods estimating haplotype 

frequencies in a population without assigning haplotypes to each individual. Preferably, a 
method based on an expectation-maximization (EM) algorithm (Dempster et al., J. R. Stat. 
Soc, 39B: 1-38, 1977) leading to maximum-likelihood estimates of haplotype frequencies 
under the assumption of Hardy-Weinberg proportions (random mating) is used (see Excofficr 
30 L. and Slatkin M., Mol. Biol. Evol, 12(5): 921-927, 1995). The EM algorithm is a generalised 
iterative maximum-likelihood approach to estimation that is useful when data are ambiguous 
and/or incomplete. The EM algorithm is used to resolve heterozygotes into haplotypes. 
Haplotype estimations are further described below under the heading "Statistical methods". 
Any other method known in the art to determine or to estimate the frequency of a haplotype in 
35 a population may also be used. 

2) Linkage Disequilibrium analysis 
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Linkage disequilibrium is the non-random association of alleles at two or more loci 
and represents a powerful tool for mapping genes involved in disease traits (see Ajioka R.S. et 
al., Am. J. Hum. Genet., 60:1439-1447, 1997). Biallelic markers, because they arc densely 
spaced in the human genome and can be genotyped in more numerous numbers than other 
5 types of genetic markers (such as RFLP or VNTR markers), are particularly useful in genetic 
analysis based on linkage disequilibrium. The biallelic markers of the present invention may 
be used in any linkage disequilibrium analysis method known in the art. 

Briefly, when a disease mutation is first introduced into a population (by a new 
mutation or the immigration of a mutation carrier), it necessarily resides on a single 
10 chromosome and thus on a single "background" or "ancestral" haplotype of linked markers. 
Consequently, there is complete disequilibrium between these markers and the disease 
mutation: one finds the disease mutation only in the presence of a specific set of marker 
alleles. Through subsequent generations recombinations occur between the disease mutation 
and these marker polymorphisms, and the disequilibrium gradually dissipates. The pace of 
1 5 this dissipation is a function of the recombination frequency, so the markers closest to the 
disease gene will manifest higher levels of disequilibrium than those that are further away. 
When not broken up by recombination, "ancestral" haplotypcs and linkage disequilibrium 
between marker alleles at different loci can be tracked not only through pedigrees but also 
through populations. Linkage disequilibrium is usually seen as an association between one 
20 specific allele at one locus and another specific allele at a second locus. 

The pattern or curve of disequilibrium between disease and marker loci is expected to 
exhibit a maximum that occurs at the disease locus. Consequently, the amount of linkage 
disequilibrium between a disease allele and closely linked genetic markers may yield valuable 
information regarding the location of the disease gene. For fine-scale mapping of a disease 
25 locus, it is useful to have some knowledge of the patterns of linkage disequilibrium that exist 
between markers in the studied region. As mentioned above the mapping resolution achieved 
through the analysis of linkage disequilibrium is much higher than that of linkage studies. The 
high density of biallelic markers combined with linkage disequilibrium analysis provides 
powerful tools for fine-scale mapping. Different methods to calculate linkage disequilibrium 
30 are described below under the heading "Statistical Methods". 

3) Population-based case-control studies of trait-marker associations 

As mentioned above, the occurrence of pairs of specific alleles at different loci on the 
same chromosome is not random and the deviation from random is called linkage 
disequilibrium. Association studies focus on population frequencies and rely on the 
35 phenomenon of linkage disequilibrium. If a specific allele in a given gene is directly involved 
in causing a particular trait, its frequency will be statistically increased in an affected (trait 
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positive) population, when compared to the frequency in a trait negative population or in a 
random control population. As a consequence of the existence of linkage disequilibrium, the 
frequency of all other alleles present in the haplotype carrying the trait-causing allele will also 
be increased in trait positive individuals compared to trait negative individuals or random 
5 controls. Therefore, association between the trait and any allele (specifically a biallelic 

marker allele) in linkage disequilibrium with the trait-causing allele will suffice to suggest the 
presence of a trait-related gene in that particular region. Case-control populations can be 
genotyped for biallelic markers to identify associations that narrowly locate a trait causing 
allele. As any marker in linkage disequilibrium with one given marker associated with a trait 
10 will be associated with the trait. Linkage disequilibrium allows the relative frequencies in 

case-control populations of a limited number of genetic polymorphisms (specifically biallelic 
markers) to be analysed as an alternative to screening all possible functional polymorphisms 
in order to find trait-causing alleles. Association studies compare the frequency of marker 
alleles in unrelated case-control populations, and represent powerful tools for the dissection of 
15 "complex traits. 

Case-control populations (inclusion criteria) 

Population-based association studies do not concern familial inheritance but compare 
the prevalence of a particular genetic marker, or a set of markers, in case-control populations. 
They are case-control studies based on comparison of unrelated case (affected or trait 
20 positive) individuals and unrelated control (unaffected or trait negative or random) 

individuals. Preferably the control group is composed of unaffected or trait negative 
individuals. Further, the control group is ethnically matched to the case population. 
Moreover, the control group is preferably matched to the case-population for the main known 
confusion factor for the trait under study (for example age-matched for an age-dependent 
25 trait). Ideally, individuals in the two samples are paired in such a way that they are expected 
to differ only in their disease status. In the following "trait positive population", "case 
population" and "affected population" are used interchangeably. 

An important step in the dissection of complex traits using association studies is the 
choice of case-control populations (see Lander and Schork, Science, 265, 2037-2048, 1994). 
30 A major step in the choice of case-control populations is the clinical definition of a given trait 
or phenotype. Any genetic trait may be analysed by the association method proposed here by 
carefully selecting the individuals to be included in the trait positive and trait negative 
phenotypic groups. Four criteria are often useful: clinical phenotype, age at onset, family 
history and severity. The selection procedure for continuous or quantitative traits (such as 
35 blood pressure for example) involves selecting individuals at opposite ends of the phenotype 
distribution of the trait under study, so as to include in these trait positive and trait negative 
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populations individuals with non-overlapping phenotypes. Preferably, case-control 
populations consist of phenotypically homogeneous populations. Trait positive and trait 
negative populations consist of phenotypically uniform populations of individuals 
representing each between 1 and 98%, preferably between 1 and 80%, more preferably 

5 between 1 and 50%, and more preferably between 1 and 30%, most preferably between 1 and 
20% of the total population under study, and selected among individuals exhibiting non- 
overlapping phenotypes. The clearer the difference between the two trait phenotypes, the 
greater the probability of detecting an association with biallelic markers. The selection of 
those drastically different but relatively uniform phenotypes enables efficient comparisons in 

10 association studies and the possible detection of marked differences at the genetic level, 
provided that the sample sizes of the populations under study are significant enough. 

In preferred embodiments, a first group of between 50 and 300 trait positive 
individuals, preferably about 100 individuals, are recruited according to their phenotypes. A 
similar number of trait negative individuals are included in such studies. 

15 Association analysis 

The general strategy to perform association studies using biallelic markers derived 
from a region carrying a candidate gene is to scan two groups of individuals (case-control 
populations) in order to measure and statistically compare the allele frequencies of the 
biallelic markers of the present invention in both groups. 
20 If a statistically significant association with a trait is identified for at least one or more 

of the analysed biallelic markers, one can assume that: either the associated allele is directly 
responsible for causing the trait (the associated allele is the trait causing allele), or more likely 
the associated allele is in linkage disequilibrium with the trait causing allele. The specific 
characteristics of the associated allele with respect to the candidate gene function usually 
25 gives further insight into the relationship between the associated allele and the trait (causal or 
in linkage disequilibrium). If the evidence indicates that the associated allele within the 
candidate gene is most probably not the trait causing allele but is in linkage disequilibrium 
with the real trait causing allele, then the trait causing allele can be found by sequencing the 
vicinity of the associated marker. 
30 Association studies are usually run in two successive steps. In a first phase, the 

frequencies of a reduced number of biallelic markers from one or several candidate genes are 
determined in the trait positive and trait negative populations In a second phase of the 
analysis, the identity of the candidate gene and the position of the genetic loci responsible for 
the given trait is further refined using a higher density of markers from the relevant region. 
35 However, if the candidate gene under study is relatively small in length, as it is the case for 

many of the candidate genes analysed included in the present invention, a single phase may be 
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sufficient to establish significant associations. 
Haplotype analysis 

As described above, when a chromosome carrying a disease allele first appears in a 
population as a result of either mutation or migration, the mutant allele necessarily resides on 
5 a chromosome having a set of linked markers: the ancestral haplotype. This haplotype can be 
tracked through populations and its statistical association with a given trait can be analysed. 
Complementing single point (allelic) association studies with multi-point association studies 
also called haplotype studies increases the statistical power of association studies. Thus, a 
haplotype association study allows one to define the frequency and the type of the ancestral 
10 carrier haplotype. A haplotype analysis is important in that it increases the statistical power of 
an analysis involving individual markers. 

In a first stage of a haplotype frequency analysis, the frequency of the possible 
haplotypes based on various combinations of the identified biallelic markers of the invention 
is determined. The haplotype frequency is then compared for distinct populations of trait 
15 positive and control individuals. The number of trait positive individuals, which should be, 
subjected to this analysis to obtain statistically significant results usually ranges between 30 
and 300, with a preferred number of individuals ranging between 50 and 1 50. The same 
considerations apply to the number of unaffected individuals (or random control) used in the 
study. The results of this first analysis provide haplotype frequencies in case-control 
20 populations, for each evaluated haplotype frequency a p-value and an odd ratio are calculated. 
If a statistically significant association is found the relative risk for an individual carrying the 
given haplotype of being affected with the trait under study can be approximated. 
Interaction Analysis 

The biallelic markers of the present invention may also be used to identify patterns of 
25 biallelic markers associated with detectable traits resulting from polygenic interactions. The 
analysis of genetic interaction between alleles at unlinked loci requires individual genotyping 
using the techniques described herein. The analysis of allelic interaction among a selected set 
of biallelic markers with appropriate level of statistical significance can be considered as a 
haplotype analysis. Interaction analysis consists in stratifying the case-control populations 
30 with respect to a given haplotype for the first loci and performing a haplotype analysis with 
the second loci with each subpopulation. 

Statistical methods used in association studies are further described below in IV.C. 
4) Testing for linkage in the presence of association 

The biallelic markers of the present invention may further be used in TDT 
35 (transmission/disequilibrium test). TDT tests for both linkage and association and is not 
affected by population stratification. TDT requires data for affected individuals and their 
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parents or data from unaffected sibs instead of from parents (see Spielmann S. et al., Am. J. 
Hum. Genets 52:506-516, 1993; Schaid D.J. et al., Genet. Epidemiol,U'A22A50, 1996, 
Spielmann S. and Ewens W.J., Am. J. Hum. Genet., 62:450-458, 1998). Such combined tests 
generally reduce the false - positive errors produced by separate analyses. 
5 IV.C. Statistical methods 

In general, any method known in the art to test whether a trait and a genotype show a 
statistically significant correlation may be used. 

1) Methods in linkage analysis 

Statistical methods and computer programs useful for linkage analysis are well-known 
10 to those skilled in the art (see Terwilliger J.D. and Ott J., Handbook of Human Genetic 

Linkage, John Hopkins University Press, London, 1994; Ott J., Analysis of Human Genetic 
Linkage, John Hopkins University Press, Baltimore, 1991). 

2) Methods to estimate haplotype frequencies in a population 

As described above, when genotypes are scored, it is often not possible to distinguish 
15 heterozygotes so that haplotype frequencies cannot be easily inferred. When the gametic phase 
is not known, haplotype frequencies can be estimated from the multilocus genotypic data. 
Any method known to person skilled in the art can be used to estimate haplotype frequencies 
(see Lange K., Mathematical and Statistical Methods for Genetic Analysis, Springer, New 
York 1997; Weir, B.S., Genetic data Analysis II: Methods for Discrete population genetic 
20 Data, Sinauer Assoc., Inc., Sunderland, MA, USA, 1996) Preferably, maximum-likelihood 

haplotype frequencies are computed using an Expectation- Maximization (EM) algorithm (see 
Dempster et al, J. R. Stat. Soc, 39B: 1-38, 1977; Excoffier L. and Slatkin M., Mol. Biol. Evol, 
12(5): 921-927, 1995). This procedure is an iterative process aiming at obtaining maximum- 
likelihood estimates of haplotype frequencies from multi-locus genotype data when the 
25 gametic phase is unknown. Haplotype estimations are usually performed by applying the EM 
algorithm using for example the EM-HAPLO program (Hawley M.E. et al., Am. J. Phys. 
AnthropoL, 1 8: 104, 1994) or the Arlequin program (Schneider et al., Arlequin: a software for 
population genetics data analysis, University of Geneva, 1997). The EM algorithm is a 
generalised iterative maximum likelihood approach to estimation and is briefly described 
30 below. 

In the following part of this text, phenotypes will refer to multi-locus genotypes with 
unknown phase. Genotypes will refer to known-phase multi-locus genotypes. 

Suppose a sample of N unrelated individuals typed for K markers. The data observed 
are the unknown-phase K-locus phenotypes that can categorised in F different phenotypes. 
35 Suppose that we have H underlying possible haplotypes (in case of K biallelic markers, 
H=2 K ). 
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For phenotype j, suppose that Cj genotypes are possible. We thus have the following equation 

c C l 

p. = J pr (genotype t ) = T.pr{h k ,h t ) Equation 1 

/=l '=1 
where Pj is the probability of the phenotypej, h and h, are the two haplotypes constituent the 
genotype i. Under the Hardy-Weinberg equilibrium, pr(h b hj becomes : 
5 pr(h k ,/»/) = pr(h k ) 2 if h k = /i/ , pr (h k , A/ ) = 2/>r (A* ).pr(/i/ ) if /i A . * /*/ . 
Equation 2 

The successive steps of the E-M algorithm can be described as follows: 

Starting with initial values of the of haplotypes frequencies, noted />, <0 \pJ°\ pf , these 

initial values serve to estimate the genotype frequencies (Expectation step) and then estimate 

10 another set of haplotype frequencies (Maximisation step), noted p\ l) , p™ ,-..p { H , these two 
steps are iterated until changes in the sets of haplotypes frequency are very small. 

A stop criterion can be that the maximum difference between haplotype frequencies 
between two iterations is less than lO" 7 . These values can be adjusted according to the desired 
precision of estimations. 

15 in details, at a given iteration s, the Expectation step consists in calculating the 

genotypes frequencies by the following equation: 

pr (genotypei ) (s) = pr(phenotype j ).pr (genotype f \phenotypej ) {s) 

nj pr(h k ,ht) {s) 
N ' pj 5 > 

Equation 3 

where genotype / occurs in phenotypey, and where h k and //, constitute genotype j. Each 
20 probability is derived according to eq. 1 , and eq.2 described above. 

Then the Maximisation step simply estimates another set of haplotype frequencies 
given the genotypes frequencies. This approach is also known as gene-counting method 
(Smith, Ann. Hum. Genet., 21:254-276, 1957). 

p< 5+1 >=rl£ jiS it .pr(genotypei) {5) E q uation4 
2 j=\i=\ 

25 Where S it is an indicator variable which count the number of time haplotype t in genotype i. It 

takes the values of 0, 1 or 2. 

To ensure that the estimation finally obtained is the maximum-likelihood estimation 
several values of departures are required. The estimations obtained are compared and if they 
are different the estimations leading to the best likelihood are kept. 



WO 99/54500 



56 



PCT/IB99/00822 



3) Methods to calculate linkage disequilibrium between markers 

A number of methods can be used to calculate linkage disequilibrium between any 
two genetic positions, in practice linkage disequilibrium is measured by applying a statistical 
association test to haplotype data taken from a population. 
5 Linkage disequilibrium between any pair of biallelic markers comprising at least one 

of the biallelic markers of the present invention (Mj, Mj) having alleles (aj/b;) at marker Mi and 
alleles (a/bj) at marker Mj can be calculated for every allele combination (a i( aj . ai,b j; b;,aj and 
bi,bj), according to the Piazza formula : 
Aaiij = V94 - ^ (94 + 93) (94 +92), where : 
I o 64= - - = frequency of genotypes not having allele a f at Mj and not having allele aj at Mj 
93= . + = frequency of genotypes not having allele aiat Mj and having allele aj at Mj 
92= + - = frequency of genotypes having allele aj at Mj and not having allele aj at Mj 

Linkage disequilibrium (LD) between pairs of biallelic markers (Mj, Mj) can also be 
calculated for every allele combination (ai,aj : ai,bj ; bi,aj andbj.bj), according to the maximum- 
1 5 likelihood estimate (MLE) for delta (the composite genotypic disequilibrium coefficient), as 
described by Weir (Weir B.S., Genetic Data Analysis, Sinauer Ass. Eds, 1996). The MLE for 
the composite linkage disequilibrium is: 
D aia j= (2n, + n 2 + n 3 + iW2)/N - 2(pr(a;).pr(aj)) 

Where n, - 1 phenotype {aja it afa), n 2 - Z phenotype (aja„ a/bj), n 3 = £ phenotype (a/b ; , 
20 a/aj), n4= Z phenotype (a/b, a/bj) and N is the number of individuals in the sample. 

This formula allows linkage disequilibrium between alleles to be estimated when only 
genotype, and not haplotype, data are available. 

Another means of calculating the linkage disequilibrium between markers is as 
follows. For a couple of biallelic markers, M, (a/b.) and Mj{a/bj), fitting the Hardy-Weinberg 
25 equilibrium, one can estimate the four possible haplotype frequencies in a given population 
according to the approach described above. 

The estimation of gametic disequilibrium between ai and aj is simply: 

D aiaj = prihaplotypeia^aj^-priaiypriaj). 
Where pr(ad is the probability of allele a, and pr^ is the probability of allele a, and 
30 where pr(haplotype (a ir a)) is estimated as in Equation 3 above. 

For a couple of biallelic marker only one measure of disequilibrium is necessary to describe 
the association between and Mj. 

Then a normalised value of the above is calculated as follows: 

D'„.j = D^, / max (-pr(ai).pr(a,) , -pr(b,).pr(b,)) with D.„j<0 
35 D^»j = D»uj/max(pr(b l ).pr(a J ), pr(a,).pr(bj)) with D„„>0 
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The skilled person will readily appreciate that other LD calculation methods can be 
used without undue experimentation. 

Linkage disequilibrium among a set of biallelic markers having an adequate 
heterozygosity rate can be determined by genotyping between 50 and 1000 unrelated 

5 individuals, preferably between 75 and 200, more preferably around 1 00. 
4) Testing for association 

Methods for determining the statistical significance of a correlation between a 
phenotype and a genotype, in this case an allele at a biallelic marker or a haplotype made up of 
such alleles, may be determined by any statistical test known in the art and with any accepted 

[0 threshold of statistical significance being required. The application of particular methods and 
thresholds of significance are well with in the skill of the ordinary practitioner of the art. 

Testing for association is performed by determining the frequency of a biallelic 
marker allele in case and control populations and comparing these frequencies with a 
statistical test to determine if their is a statistically significant difference in frequency which 

1 5 would indicate a correlation between the trait and the biallelic marker allele under study. 
Similarly, a haplotype analysis is performed by estimating the frequencies of all possible 
haplotypes for a given set of biallelic markers in case and control populations, and comparing 
these frequencies with a statistical test to determine if their is a statistically significant 
correlation between the haplotype and the phenotype (trait) under study. Any statistical tool 

20 useful to test for a statistically significant association between a genotype and a phenotype 
may be used. Preferably the statistical test employed is a chi-square test with one degree of 
freedom. A p-value is calculated (the p-value is the probability that a statistic as large or 
larger than the observed one would occur by chance). 
Statistical significance 

25 In preferred embodiments, significance for diagnosis purposes, either as a positive 

basis for further diagnostic tests or as a preliminary starting point for early preventive therapy, 
the p value related to a biallelic marker association is preferably about 1 x 10-2 or less, more 
preferably about 1 x 10-4 or less, for a single biallelic marker analysis and about 1x10-3 or 
less, still more preferably 1 x 10-6 or less and most preferably of about 1 x 10-8 or less, for a 
30 haplotype analysis involving several markers. These values are believed to be applicable to 
any association studies involving single or multiple marker combinations. 

The skilled person can use the range of values set forth above as a starting point in 
order to carry out association studies with biallelic markers of the present invention. In doing 
so, significant associations between the biallelic markers of the present invention and diseases 
35 can be revealed. 

Phenotypic permutation 
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In order to confirm the statistical significance of the first stage haplotype analysis 
described above, it might be suitable to perform further analyses in which genotyping data 
from case-control individuals are pooled and randomised with respect to the trait phenotype. 
Each individual genotyping data is randomly allocated to two groups, which contain the same 

5 number of individuals as the case-control populations used to compile the data obtained in the 
first stage. A second stage haplotype analysis is preferably run on these artificial groups, 
preferably for the markers included in the haplotype of the first stage analysis showing the 
highest relative risk coefficient. This experiment is reiterated preferably at least between 100 
and 10000 times. The repeated iterations allow the determination of the percentage of 

10 obtained haplotypes with a significant p-value level. 
Assessment of statistical association 

To address the problem of false positives similar analysis may be performed with the 
same case-control populations in random genomic regions. Results in random regions and the 
candidate region are compared as described in US Provisional Patent Application entitled 

15 "Methods, software and apparati for identifying genomic regions harbouring a gene associated 
with a detectable trait". 
5) Evaluation of risk factors 

The association between a risk factor (in genetic epidemiology the risk factor is the 
presence or the absence of a certain allele or haplotype at marker loci) and a disease is 

20 measured by the odds ratio (OR) and by the relative risk (RR). If P(R + ) is the probability of 
developing the disease for individuals with R and P(R") is the probability for individuals 
without the risk factor, then the relative risk is simply the ratio of the two probabilities, that is: 

RR= P(R + )/P(R") 

In case-control studies, direct measures of the relative risk cannot be obtained because 
25 of the sampling design. However, the odds ratio allows a good approximation of the relative 
risk for low-incidence diseases and can be calculated: 




F + is the frequency of the exposure to the risk factor in cases and F is the frequency of the 
exposure to the risk factor in controls. F* and F are calculated using the allelic or haplotype 
30 frequencies of the study and further depend on the underlying genetic model (dominant, 
recessive, additive...). 

One can further estimate the attributable risk (AR) which describes the proportion of 
individuals in a population exhibiting a trait due to a given risk factor. This measure is 
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important in quantitating the role of a specific factor in disease etiology and in terms of the 
public health impact of a risk factor. The public health relevance of this measure lies in 
estimating the proportion of cases of disease in the population that could be prevented if the 
exposure of interest were absent. AR is determined as follows: 
5 AR = P E (RR-1)/ (P E (RR-1)+1) 

AR is the risk attributable to a biallelic marker allele or a biallelic marker haplotype. P E is the 
frequency of exposure to an allele or a haplotype within the population at large; and RR is the 
relative risk which, is approximated with the odds ratio when the trait under study has a 
relatively low incidence in the general population. 
IQ IV.F. Identification Of Biallelic Markers In Linkage Disequilibrium With The Biallelic 
Markers of the Invention 

Once a first biallelic marker has been identified in a genomic region of interest, the 
practitioner of ordinary skill in the art, using the teachings of the present invention, can easily 
identify additional biallelic markers in linkage disequilibrium with this first marker. As 
1 5 mentioned before any marker in linkage disequilibrium with a first marker associated with a 
trait will be associated with the trait. Therefore, once an association has been demonstrated 
between a given biallelic marker and a trait, the discovery of additional biallelic markers 
associated with this trait is of great interest in order to increase the density of biallelic markers 
in this particular region. The causal gene or mutation will be found in the vicinity of the 
20 marker or set of markers showing the highest correlation with the trait. 

Identification of additional markers in linkage disequilibrium with a given marker 
involves: (a) amplifying a genomic fragment comprising a first biallelic marker from a 
plurality of individuals; (b) identifying of second biallelic markers in the genomic region 
harboring said first biallelic marker; (c) conducting a linkage disequilibrium analysis between 
25 said first biallelic marker and second biallelic markers; and (d) selecting said second biallelic 
markers as being in linkage disequilibrium with said first marker. Subcombinations 
comprising steps (b) and (c) are also contemplated. 

Methods to identify biallelic markers and to conduct linkage disequilibrium analysis 
are described herein and can be carried out by the skilled person without undue 
30 experimentation. The present invention then also concerns biallelic markers which are in 

linkage disequilibrium with any of the specific biallelic markers of SEQ ID Nos. 1 to 3908, 1 
to 2260, 2261 to 3374, and 3735 to 3908 and which are expected to present similar 
characteristics in terms of their respective association with a given trait. 

Example 5 illustrates the measurement of linkage disequilibrium between a publicly 
35 known biallelic marker, the "ApoE Site A", located within the Alzheimer's related ApoE 
gene, and other biallelic markers randomly derived from the genomic region containing the 
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ApoE gene. 

TV.G. Identification Of Functional Mutations 

Once a positive association is confirmed with a biallelic marker of the present 
invention, the associated candidate gene can be scanned for mutations by comparing the 
5 sequences of a selected number of trait positive and trait negative individuals. In a preferred 
embodiment, functional regions such as exons and splice sites, promoters and other regulatory 
regions of the candidate gene are scanned for mutations. Preferably, trait positive individuals 
carry the haplotype shown to be associated with the trait and trait negative individuals do not 
carry the haplotype or allele associated with the trait. The mutation detection procedure is 
10 essentially similar to that used for biallelic site identification. 

The method used to detect such mutations generally comprises the following steps: (a) 
amplification of a region of the candidate gene comprising a biallelic marker or a group of 
biallelic markers associated with the trait from DNA samples of trait positive patients and trait 
negative controls; (b) sequencing of the amplified region^(c) comparison of DNA sequences 
1 5 from trait-positive patients and trait-negative controls; and (d) determination of mutations 
specific to trait-positive patients. Subcombinations which comprise steps (b) and (c) are 
specifically contemplated. 

It is preferred that candidate polymorphisms be then verified by screening a larger 
population of cases and controls by means of any genotyping procedure such as those 
20 described herein, preferably using a microsequencing technique in an individual test format. 
Polymorphisms are considered as candidate mutations when present in cases and controls at 
frequencies compatible with the expected association results. 
V, Biallelic Markers Of The Invention In Methods Of Genetic Diagnostics 

The biallelic markers of the present invention can also be used to develop diagnostics 
25 tests capable of identifying individuals who express a detectable trait as the result of a specific 
genotype or individuals whose genotype places them at risk of developing a detectable trait at 
a subsequent time. The trait analyzed using the present diagnostics may be any detectable trait, 
including a disease, a response to an agent acting on a disease, or side effects to an agent 
acting on a disease. 

30 The diagnostic techniques of the present invention may employ a variety of 

methodologies to determine whether a test subject has a biallelic marker pattern associated 
with an increased risk of developing a detectable trait or whether the individual suffers from a 
detectable trait as a result of a particular mutation, including methods which enable the 
analysis of individual chromosomes for haplotyping, such as family studies, single sperm 

35 DNA analysis or somatic hybrids. 



WO 99/54500 PCT/1B99/00822 

61 

The present invention provides diagnostic methods to determine whether an individual 
is at risk of developing a disease or suffers from a disease resulting from a mutation or a 
polymorphism in a candidate gene of the present invention. The present invention also 
provides methods to determine whether an individual is likely to respond positively to an 
5 agent acting on a disease or whether an individual is at risk of developing an adverse side 
effect to an agent acting on a disease. 

These methods involve obtaining a nucleic acid sample from the individual and, 
determining, whether the nucleic acid sample contains at least one allele or at least one 
biallelic marker haplotype, indicative of a risk of developing the trait or indicative that the 
10 individual expresses the trait as a result of possessing a particular candidate gene 
polymorphism or mutation (trait-causing allele). 

Preferably, in such diagnostic methods, a nucleic acid sample is obtained from the individual 
and this sample is genotyped using methods described above in III. The diagnostics may be 
based on a single biallelic marker or a on group of biallelic markers. 
15 " in ea ch of these methods, a nucleic acid sample is obtained from the test subject and 

the biallelic marker pattern of one or more of the biallelic markers of SEQ ID Nos. 1 to 3908, 
1 to 2260, 2261 to 3374, and 3735 to 3908 is determined. 

In one embodiment, a PCR amplification is conducted on the nucleic acid sample to 
amplify regions in which polymorphisms associated with a detectable phenotype have been 
20 identified. The amplification products are sequenced to determine whether the individual 
possesses one or more polymorphisms associated with a detectable phenotype. The primers 
used to generate amplification products may comprise the primers of SEQ ID Nos. 3935 to 
7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 1 1773, 7866 to 10125, 10126 to 
1 1599, and 1 1600 to 1 1773. Alternatively, the nucleic acid sample is subjected to 
25 microsequencing reactions as described above to determine whether the individual possesses 
one or more polymorphisms associated with a detectable phenotype resulting from a mutation 
or a polymorphism in a candidate gene. In another embodiment, the nucleic acid sample is 
contacted with one or more allele specific oligonucleotide probes which, specifically 
hybridize to one or more candidate gene alleles associated with a detectable phenotype. 
30 These diagnostic methods are extremely valuable as they can, in certain 

circumstances, be used to initiate preventive treatments or to allow an individual carrying a 
significant haplotype to foresee warning signs such as minor symptoms. In diseases in which 
attacks may be extremely violent and sometimes fatal if not treated on time, such as disease, 
the knowledge of a potential predisposition, even if this predisposition is not absolute, might 
35 contribute in a very significant manner to treatment efficacy. Similarly, a diagnosed 
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predisposition to a potential side effect could immediately direct the physician toward a 
treatment for which such side effects have not been observed during clinical trials. 

Diagnostics, which analyze and predict response to a drug or side effects to a drug, 
may be used to determine whether an individual should be treated with a particular drug. For 

5 example, if the diagnostic indicates a likelihood that an individual will respond positively to 
treatment with a particular drug, the drug may be administered to the individual. Conversely, 
if the diagnostic indicates that an individual is likely to respond negatively to treatment with a 
particular drug, an alternative course of treatment may be prescribed. A negative response 
may be defined as either the absence of an efficacious response or the presence of toxic side 

10 effects. 

Clinical drug trials represent another application for the markers of the present 
invention. One or more markers indicative of response to an agent acting on a disease or to 
side effects to an agent acting on a disease may be identified using the methods described 
above. Thereafter, potential participants in clinical trials of such an agent may be screened to 
15 identify those individuals most likely to respond favorably to the drug and exclude those likely 
to experience side effects. In that way, the effectiveness of drug treatment may be measured 
in individuals who respond positively to the drug, without lowering the measurement as a 
result of the inclusion of individuals who are unlikely to respond positively in the study and 
without risking undesirable safety problems. 
20 VI. Computer-Related Embodiments 

In some embodiments of the present invention a computer to based system may 
support the on-line coordination between the identification of biallelic markers and the 
corresponding analysis of their frequency in the different groups. 

As used herein the term "nucleic acid codes of SEQ ID NOs. 1 to 3908, 1 to 2260, 
25 2261 to 3374, 3735 to 3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 
11773,7866 to 10125, 10126 to 11599, and 11600 to 11773" encompasses the nucleotide 
sequences of SEQ ID NOs. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 
3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 1 1773, 7866 to 10125, 10126 to 1 1599, 
and 1 1600 to 1 1773, fragments of SEQ ID NOs. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 
30 3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 1 1773, 7866 to 10125, 
10126 to 1 1599, and 1 1600 to 1 1773, nucleotide sequences homologous to SEQ ED NOs. 1 to 
3908, 1 to 2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 
to 7842, 7866 to 1 1773, 7866 to 10125, 10126 to 1 1599, and 1 1600 to 1 1773 or homologous to 
fragments of SEQ ID NOs. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 
35 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 1 1773, 7866 to 10125, 10126 to 1 1599, 
and 1 1600 to 1 1773, and sequences complementary to all of the preceding sequences. As used 
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herein the term "nucleic acid codes of SEQ ID NOs. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 
to 3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 1 1773, 7866 to 
10125, 10126 to 11599, and 11600 to 11773" further encompasses the nucleotide sequences 
comprising, consisting essentially of, or consisting of any one of the following: 
5 a) a contiguous span of at least 8, 10, 12, 15, 18, 19, 20, 22, 23, 24, 25, 30, 35, 43, 44, 

45, 46 or 47 nucleotides, to the extent that a contiguous span of these lengths is consistent 
with the lengths of the particular Sequence ED, of any of SEQ ID Nos. 1 to 2260 or the 

complements thereof; 

b) a contiguous span of at least 8, 10, 12, 15, 18, 19, 20, 22, 23, 24, 25, 30, 35, 43, 44, 
10 45, 46 or 47 nucleotides, to the extent that a contiguous span of these lengths is consistent 

with the lengths of the particular Sequence ID, of any of SEQ ID Nos. 1 to 2260 or the 
complements thereof, further comprising the 1 st allele of the polymorphic base of the 
respective SEQ ED number; 

c) a contiguous span of at least 8, 10, 12, 15, 18, 19, 20, 22, 23, 24, 25, 30, 35, 43, 44, 
15 45, 46 or 47 nucleotides, to the extent that a contiguous span of these lengths is consistent 

with the lengths of the particular Sequence ED, of any of SEQ ID Nos. 1 to 2260 or the 
complements thereof, further comprising the 2 ND allele of the polymorphic base of the 
respective SEQ ID number; 

d) a contiguous span of at least 8, 10, 12, 15, 18, 19, 20, 22, 23, 24, 25, 30, 35, 43, 44, 
20 45, 46 or 47 nucleotides, to the extent that a contiguous span of these lengths is consistent 

with the lengths of the particular Sequence ID, of any of SEQ ID Nos. 2261 to 3734 or the 

complements thereof; 

e) a contiguous span of at least 8, 10, 12, 15, 18, 19, 20, 22, 23, 24, 25, 30, 35, 43, 44, 
45, 46 or 47 nucleotides, to the extent that a contiguous span of these lengths is consistent 

25 with the lengths of the particular Sequence ED, of any of SEQ ED Nos. 2261 to 3734 or the 
complements thereof, further comprising the 1 st allele of the polymorphic base of the 
respective SEQ ID number; 

f) a contiguous span of at least 8, 10, 12, 15, 18, 19, 20, 22, 23, 24, 25, 30, 35, 43, 44, 
45, 46 or 47 nucleotides, to the extent that a contiguous span of these lengths is consistent 

30 with the lengths of the particular Sequence ID, of any of SEQ ID Nos. 2261 to 3734 or the 
complements thereof, further comprising the 2 nd allele of the polymorphic base of the 
respective SEQ ID number; 

g) a contiguous span of at least 8, 10, 12, 15, 18, 19,20,22,23,24, 25,30,35,43,44, 
45, 46 or 47 nucleotides, to the extent that a contiguous span of these lengths is consistent 

35 with the lengths of the particular Sequence ID, of any of SEQ ID Nos. 3735 to 3908 or the 
complements thereof; 
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h) acontiguousspanofatleast8, 10, 12, 15, 18, 19, 20, 22,23,24,25, 30, 35,43,44, 
45, 46 r 47 nucleotides, to the extent that a contiguous span of these lengths is consistent 
with the lengths of the particular Sequence ID, of any of SEQ ID Nos. 3735 to 3908 or the 
complements thereof, further comprising the 1 st allele of the polymorphic base of the 

5 respective SEQ ID number; 

i) a contiguous span of at least 8, 10, 12, 15, 18, 19, 20, 22, 23,24,25,30,35,43,44, 
45, 46 or 47 nucleotides, to the extent that a contiguous span of these lengths is consistent 
with the lengths of the particular Sequence ID, of any of SEQ ID Nos. 3735 to 3908 or the 
complements thereof, further comprising the 2 nd allele of the polymorphic base of the 

1 0 respective SEQ ID number; and 

j) a contiguous span of at least 8, 10, 12, 15, 18, 19,20, or 21 nucleotides, to the 
extent that a contiguous span of these lengths is consistent with the lengths of the particular 
Sequence ID, of any of SEQ ID Nos. 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 
7866 to 11773, 7866 to 10125, 10126 to 11599, and U600 to 11773 or the complements 

15 thereof. 

The "nucleic acid codes of SEQ ID NOs. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 
3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 1 1773, 7866 to 10125, 
10126 to 11 599, and 1 1600 to 1 1773" further encompass nucleotide sequences homologous to: 

a) a contiguous span of at least 8, 10, 12, 15, 18, 19, 20, 22, 23, 24, 25, 30, 35, 43, 44, 
20 45, 46 or 47 nucleotides, to the extent that a contiguous span of these lengths is consistent 

with the lengths of the particular Sequence ID, of any of SEQ ID Nos. 1 to 2260 or the 
complements thereof; 

b) a contiguous span of at least 8. 10, 12, 15, 18, 19, 20, 22, 23, 24, 25, 30, 35, 43, 44, 
45, 46 or 47 nucleotides, to the extent that a contiguous span of these lengths is consistent 

25 with the lengths of the particular Sequence ID, of any of SEQ ID Nos. 1 to 2260 or the 
complements thereof, further comprising the 1 st allele of the polymorphic base of the 
respective SEQ ID number; 

c) a contiguous span of at least 8, 10, 12, 15, 18, 19, 20, 22, 23, 24, 25, 30, 35, 43, 44, 
45, 46 or 47 nucleotides, to the extent that a contiguous span of these lengths is consistent 

30 with the lengths of the particular Sequence ID, of any of SEQ ID Nos. 1 to 2260 or the 
complements thereof, further comprising the 2 nd allele of the polymorphic base of the 
respective SEQ ID number; 

d) a contiguous span of at least 8, 10, 12, 15, 18, 19, 20, 22, 23, 24, 25, 30, 35, 43, 44. 
45, 46 or 47 nucleotides, to the extent that a contiguous span of these lengths is consistent 

35 with the lengths of the particular Sequence ID, of any of SEQ ID Nos. 2261 to 3734 or the 
complements thereof; 
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c) a contiguous span of at least 8, 10, 12, 15, 18, 19, 20, 22, 23, 24, 25, 30, 35, 43, 44, 
45, 46 or 47 nucleotides, to the extent that a contiguous span of these lengths is consistent 
with the lengths of the particular Sequence ID, of any of SEQ ID Nos. 2261 to 3734 or the 
complements thereof, further comprising the 1 st allele of the polymorphic base of the 
5 respective SEQ ID number; 

f) a contiguous span of at least 8, 10, 12, 15, 18, 19, 20, 22, 23, 24, 25, 30, 35, 43, 44, 
45, 46 or 47 nucleotides, to the extent that a contiguous span of these lengths is consistent 
with the lengths of the particular Sequence ID, of any of SEQ ID Nos. 2261 to 3734 or the 
complements thereof, further comprising the 2 nd allele of the polymorphic base of the 

10 respective SEQ ID number, 

g) a contiguous span of at least 8, 10, 12, 15, 18, 19, 20, 22, 23, 24, 25, 30, 35, 43, 44, 
45, 46 or 47 nucleotides, to the extent that a contiguous span of these lengths is consistent 
with the lengths of the particular Sequence ID, of any of SEQ ID Nos. 3735 to 3908 or the 
complements thereof; 

15 h) a contiguous span of at least 8, 10, 12, 15, 18, 19, 20, 22, 23, 24, 25, 30, 35, 43, 44, 

45, 46 or 47 nucleotides, to the extent that a contiguous span of these lengths is consistent 
with the lengths of the particular Sequence ID, of any of SEQ ID Nos. 3735 to 3908 or the 
complements thereof, further comprising the 1 st allele of the polymorphic base of the 
respective SEQ ID number; 
20 i) a contiguous span of at least 8, 10, 12, 15, 18, 19, 20, 22, 23, 24, 25, 30, 35, 43, 44, 

45, 46 or 47 nucleotides, to the extent that a contiguous span of these lengths is consistent 
with the lengths of the particular Sequence ID, of any of SEQ ID Nos. 3735 to 3908 or the 
complements thereof, further comprising the 2 nd allele of the polymorphic base of the 
respective SEQ ID number; and 
25 j) a contiguous span of at least 8, 10, 12, 15, 18, 19, 20, or 21 nucleotides, to the 

extent that a contiguous span of these lengths is consistent with the lengths of the particular 
Sequence ID, of any of SEQ ID Nos. 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 
7866 to 11773, 7866 to 10125, 10126 to 11599, and 11600 to 1 1773 or the complements 
thereof. 

30 Homologous sequences refer to a sequence having at least 99%, 98%, 97%, 96%. 95%, 

90%, 85%, 80%, or 75% homology to these contiguous spans. Homology may be determined 
using any method described herein, including BLAST2N with the default parameters or with any 
modified parameters. Homologous sequences also may include RNA sequences in which 
uridines replace the thymines in the nucleic acid codes of the invention. It will be appreciated 

35 that the nucleic acid codes of the invention can be represented in the traditional single character 
format (See the inside back cover of Stryer, Lubert. Biochemistry, 3* edition. W. H Freeman & 
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Co., New York.) or in any other format or code which records the identity of the nucleotides in a 
sequence. 

It should be noted that the nucleic acid codes of the invention further encompass all of 
the polynucleotides disclosed, described or claimed in the present application. Moveover, the 
5 present invention specifically contemplates computer readable media and computer systems 
wherein such codes are stored individually or in any combination. 

It will be appreciated by those skilled in the art that the nucleic acid codes of SEQ ID 
Nos. 1 to 3908, I to 2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 3935 to 6194, 6195 to 
7668, 7669 to 7842, 7866 to 11773,7866 to 10125, 10126 to 11599, and 11600 to 11773 can 
10 be stored, recorded, and manipulated on any medium which can be read and accessed by a 
computer. As used herein, the words "recorded'* and "stored" refer to a process for storing 
information on a computer medium. A skilled artisan can readily adopt any of the presently 
known methods for recording information on a computer readable medium to generate 
embodiments comprising one or more of the nucleic acid codes of SEQ ID Nos. 1 to 3908, 1 to 
15 2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 3935 to 6194, 6195 to 766S, 7669 to 7842, 
7866 to 11773, 7866 to 10125, 10126 to 1 1599, and 1 1600 to 11773. A particularly preferred 
embodiment of the present invention is a computer readable medium having recorded thereon at 
least 2, 5, 10, 15, 20, 25, 30, 50, 100, 200, 500, 1000, 2000, or 5000 nucleic acid codes of SEQ 
ID Nos. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 3935 to 6194, 6195 
20 to 7668, 7669 to 7842, 7866 to 11773, 7866 to 10125, 10126 to 11599, and 11600 to 11773. 

Computer readable media include magnetically readable media, optically readable 
media, electronically readable media and magnetic/optical media. For example, the computer 
readable media may be a hard disk, a floppy disk, a magnetic tape, CD-ROM, Digital Versatile 
Disk (DVD), Random Access Memory (RAM), or Read Only Memory (ROM) as well as other 
25 types of other media known to those skilled in the art. 

Embodiments of the present invention include systems, particularly computer systems 
which store and manipulate the sequence information described herein. One example of a 
computer system 100 is illustrated in block diagram form in Figure 14. As used herein, u a 
computer system" refers to the hardware components, software components, and data storage 
30 components used to analyze the nucleotide sequences of the nucleic acid codes of SEQ ID NOs. 
1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 
7669 to 7842, 7866 to 11773, 7866 to 10125, 10126 to 11599, and 1 1600 to 11773. In one 
embodiment, the computer system 100 is a Sun Enterprise 1000 server (Sun Microsystems, Palo 
Alto, CA). The computer system 100 preferably includes a processor for processing, accessing 
35 and manipulating the sequence data. The processor 105 can be any well-known type of central 
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processing unit, such as the Pentium ffl from Intel Corporation, or similar processor from Sun, 
Motorola, Compaq or International Business Machines. 

Preferably, the computer system 100 is a general purpose system that comprises the 
processor 105 and one or more internal data storage components 1 10 for storing data, and one or 
5 more data retrieving devices for retrieving the data stored on the data storage components. A 

skilled artisan can readily appreciate that any one of the currently available computer systems are 
suitable. 

In one particular embodiment, the computer system 100 includes a processor 105 
connected to a bus which is connected to a main memory 1 15 (preferably implemented as RAM) 

10 and one or more internal data storage devices 1 10, such as a hard drive and/or other computer 
readable media having data recorded thereon. In some embodiments, the computer system 100 
further includes one or more data retrieving device 1 1 8 for reading the data stored on the internal 
data storage devices 110. 

The data retrieving device 1 1 8 may represent, for example, a floppy disk drive, a 

15 ~ compact disk drive, a magnetic tape drive, etc. In some embodiments, the internal data storage 
device 1 10 is a removable computer readable medium such as a floppy disk, a compact disk, a 
magnetic tape, etc. containing control logic and/or data recorded thereon. The computer system 
100 may advantageously include or be programmed by appropriate software for reading the 
control logic and/or the data from the data storage component once inserted in the data retrieving 

20 device. 

The computer system 100 includes a display 120 which is used to display output to a 
computer user. It should also be noted that the computer system 100 can be linked to other 
computer systems 125a-c in a network or wide area network to provide centralized access to the 
computer system 100. 

25 Software for accessing and processing the nucleotide sequences of the nucleic acid codes of SEQ 
ID Nos. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 3935 to 6194, 6195 
to 7668, 7669 to 7842, 7866 to 1 1773. 7866 to 10125, 10126 to 1 1599, and 1 1600 to 11773 
(such as search tools, compare tools, and modeling tools etc.) may reside in main memory 115 
during execution. 

30 In some embodiments, the computer system 1 00 may further comprise a sequence 

comparer for comparing the above^iescribed nucleic acid codes of SEQ ID Nos. 1 to 3908, 1 to 
2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842. 
7866 to 11773, 7866 to 10125, 10126 to 11599, and 11600 to 11773 stored on a computer 
readable medium to reference nucleotide or polypeptide sequences stored n a computer 

35 readable medium. A "sequence comparer" refers to one or more programs which are 
implemented on the computer system 100 to compare a nucleotide sequence with other 
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nucleotide sequences and/or compounds stored within the data storage means. For example, the 
sequence comparer may compare the nucleotide sequences of the nucleic acid codes of SEQ ID 
Nos. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908, 3935 to 7842. 3935 to 6194, 6195 to 
7668, 7669 to 7842, 7866 to 1 1773, 7866 to 10125, 10126 to 1 1599, and 1 1600 to 1 1773 
5 stored on a computer readable medium to reference sequences stored on a computer readable 
medium to identify homologies or structural motifs. The various sequence comparer programs 
identified elsewhere in this patent specification are particularly contemplated for use in this 
aspect of the invention. 

Figure 15 is a flow diagram illustrating one embodiment of a process 200 for comparing 
10 a new nucleotide or protein sequence with a database of sequences in order to determine the 

homology levels between the new sequence and the sequences in the database. The database of 
sequences can be a private database stored within the computer system 100, or a public database 
such as GENBANK. that is available through the Internet. 

The process 200 begins at- a start state 201 and then moves to a state 202 wherein the . 
1 5 new sequence to be compared is stored to a memory in a computer system 1 00. As discussed 
above, the memory could be any type of memory, including RAM or an internal storage device. 

The process 200 then moves to a state 204 wherein a database of sequences is opened for 
analysis and comparison. The process 200 then moves to a state 206 wherein the first sequence 
stored in the database is read into a memory on the computer. A comparison is then performed at 
20 a state 210 to determine if the first sequence is the same as the second sequence. It is important 
to note that this step is not limited to performing an exact comparison between the new sequence 
and the first sequence in the database. Well-known methods are known to those of skill in the art 
for comparing two nucleotide or protein sequences, even if they are not identical. For example, 
gaps can be introduced into one sequence in order to raise the homology level between the two 
25 tested sequences. The parameters that control whether gaps or other features are introduced into 
a sequence during comparison are normally entered by the user of the computer system 
Once a comparison of the two sequences has been performed at the state 210, a 
determination is made at a decision state 210 whether the two sequences are the same. Of 
course, the term "same" is not limited to sequences that are absolutely identical. Sequences that 
30 are within the homology parameters entered by the user will be marked as "same" in the process 
200. 

If a determination is made that the two sequences are the same, the process 200 moves to 
a state 214 wherein the name of the sequence from the database is displayed to the user. This 
state notifies the user that the sequence with the displayed name fulfills the homology constraints 
35 that were entered. Once the name of the stored sequence is displayed to the user, the process 200 
moves to a decision state 218 wherein a determination is made whether more sequences exist in 
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the database. If no more sequences exist in the database, then the process 200 terminates at an 
end state 220. However, if more sequences do exist in the database, then the process 200 moves 
to a state 224 wherein a pointer is moved to the next sequence in the database so that it can be 
compared to the new sequence. In this manner, the new sequence is aligned and compared with 
5 every sequence in the database. 

It should be noted that if a determination had been made at the decision state 2 1 2 that the 
sequences were not homologous, then the process 200 would move immediately to the decision 
state 218 in order to determine if any other sequences were available in the database for 
comparison. 

10 Accordingly, one aspect of the present invention is a computer system comprising a 

processor, a data storage device having stored thereon a nucleic acid code of SEQ ID Nos. 1 to 
3908, 1 to 2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 
to 7842, 7866 to 11773,7866 to 10125, 10126 to 1 1599, and 11 600 to 11773, a data storage 
device having retrievably stored thereon reference nucleotide sequences or polypeptide 
15 sequences to be compared to the nucleic acid code of SEQ ID Nos. 1 to 3908, 1 to 2260, 2261 
to 3374, 3735 to 3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 
11773, 7866 to 10125, 10126 to 11599, and 11600 to 11773 and a sequence comparer for 
conducting the comparison. The sequence comparer may indicate a homology level between 
the sequences compared or identify structural motifs in the above described nucleic acid code 
20 of SEQ ID Nos. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 3935 to 

6194, 6195 to 7668, 7669 to 7842, 7866 to 1 1773, 7866 to 10125, 10126 to 1 1599, and 1 1 600 
to 1 1773 or it may identify structural motifs in sequences which are compared to these nucleic 
acid codes and polypeptide codes. In some embodiments, the data storage device may have 
stored thereon the sequences of at least 2, 5, 10, 15, 20, 25, 30, 50, 100, 200, 500, 1000, 2000, 
25 or 5000 of the nucleic acid codes of SEQ ID Nos. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 
3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 1 1773, 7866 to 10125, 
10126 to 11599, and 11600 to 11773. 

Another aspect of the present invention is a method for determining the level of 
homology between a nucleic acid code of SEQ ID NOs. 1 to 3908, 1 to 2260, 2261 to 3374, 
30 3735 to 3908, 3935 to 7842, 3935 to 6194, 6195 to 7668. 7669 to 7842, 7866 to 1 1773, 7866 
to 10125, 10126 to 11599, and 1 1600 to 11773 and a reference nucleotide sequence, 
comprising the steps of reading the nucleic acid code and the reference nucleotide sequence 
through the use of a computer program which determines homology levels and determining 
homology between the nucleic acid code and the reference nucleotide sequence with the 
35 computer program. Thecomputerprogrammay be any ofanumberofc mputer programs for 
determining homology levels, including those specifically enumerated herein, including 
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BLAST2N with the default parameters or with any modified parameters. The method may be 
implemented using the computer systems described above. The method may also be performed 
by reading at least 2, 5, 10, 15, 20, 25, 30, 50, 100, 200, 500, 1000, 2000, or 5000 of the above 
described nucleic acid codes of SEQ ID NOs. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 
5 3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 1 1773, 7866 to 10125, 
10126 to 11599, and 11600 to 11773 through use of the computer program and determining 
homology between the nucleic acid codes and reference nucleotide sequences . 

Figure 16 is a flow diagram illustrating one embodiment of a process 250 in a 
computer for determining whether two sequences are homologous. The process 250 begins at 

10 a start state 252 and then moves to a state 254 wherein a first sequence to be compared is 
stored to a memory. The second sequence to be compared is then stored to a memory at a 
state 256. The process 250 then moves to a state 260 wherein the first character in the first 
sequence is read and then to a state 262 wherein the first character of the second sequence is 
read. It should be understood that if the sequence is a nucleotide sequence, then the character 

1 5 would normally be either A, T, C, G or U. If the sequence is a protein sequence, then it 

should be in the single letter amino acid code so that the first and sequence sequences can be 
easily compared. 

A determination is then made at a decision state 264 whether the two characters are 
the same. If they are the same, then the process 250 moves to a state 268 wherein the next 
20 characters in the first and second sequences arc read. A determination is then made whether 
the next characters are the same. If they are, then the process 250 continues this loop until 
two characters are not the same. If a determination is made that the next two characters are 
not the same, the process 250 moves to a decision state 274 to determine whether there are any 
more characters either sequence to read. 
25 If there aren't any more characters to read, then the process 250 moves to a state 276 

wherein the level of homology between the first and second sequences is displayed to the user. 
The level of homology is determined by calculating the proportion of characters between the 
sequences that were the same out of the total number of sequences in the first sequence. Thus, 
if every character in a first 100 nucleotide sequence aligned with a every character in a second 
30 sequence, the homology level would be 100%. 

Alternatively, the computer program may be a computer program which compares the 
nucleotide sequences of the nucleic acid codes of the present invention, to reference nucleotide 
sequences in order to determine whether the nucleic acid code of SEQ ID NOs. 1 to 3908, 1 to 
2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 
35 7866 to 11773, 7866 to 10125, 10126 to 1 1599, and 11600 to 11773 differs from a reference 
nucleic acid sequence at one or more positions. Optionally such a program records the length 
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and identity of inserted, deleted or substituted nucleotides with respect to the sequence of either 
the reference polynucleotide or the nucleic acid code of SEQ ID NOs. 1 to 3908, I to 2260, 
2261 t 3374, 3735 to 3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 
1 1773, 7866 to 10125, 10126 to 1 1599, and 1 1600 to 1 1773. In one embodiment, the computer 
5 program may be a program which determines whether the nucleotide sequences of the nucleic 
acid codes of SEQ ID NOs. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 
3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 1 1773, 7866 to 10125, 10126 to 1 1599, 
and 1 1600 to 1 1773 contain a biallelic marker or single nucleotide polymorphism (SNP) with 
respect to a reference nucleotide sequence. This single nucleotide polymorphism may comprise a 
10 single base substitution, insertion, or deletion, while this biallelic marker may comprise about 
one to ten consecutive bases substituted, inserted or deleted. 

Accordingly, another aspect of the present invention is a method for determining 
whether a nucleic acid code of SEQ ID NOs. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 
3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 1 1773, 7866 to 10125, 
15 10126 to 1 1599, and 1 1600 to 1 1773 differs at one or more nucleotides from a reference 

nucleotide sequence comprising the steps of reading the nucleic acid code and the reference 
nucleotide sequence through use of a computer program which identifies differences between 
nucleic acid sequences and identifying differences between the nucleic acid code and the 
reference nucleotide sequence with the computer program. In some embodiments, the 
20 computer program is a program which identifies single nucleotide polymorphisms. The 
method may be implemented by the computer systems described above and the method 
illustrated in Figure 16. The method may also be performed by reading at least 2, 5. 10, 15. 20, 
25, 30, 50, 100, 200, 500, 1000, 2000, or 5000 of the nucleic acid codes of SEQ ID NOs. 1 to 
3908, 1 to 2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 
25 to 7842, 7866 to 11773, 7866 to 10125, 10126 to 11599, and 11600 to 1 1773 and the 

reference nucleotide sequences through the use of the computer program and identifying 
differences between the nucleic acid codes and the reference nucleotide sequences with the 
computer program. 

In other embodiments the computer based system may further comprise an identifier for 
30 identifying features within the nucleotide sequences of the nucleic acid codes of SEQ ID NOs. 

1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 

7669 to 7842. 7866 to 1 1773, 7866 to 1 01 25, 10126 to 1 1599, and 1 1600 to 1 1773. 

An "identifier" refers to one or more programs which identifies certain features within 

the above-described nucleotide sequences of the nucleic acid codes of SEQ ID NOs. 1 to 
35 3908, 1 to 2260, 2261 to 3374. 3735 to 3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 

to 7842, 7866 to 11773, 7866 to 10125, 10126 to 11599, and 11600 to 11773. In one 
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embodiment, the identifier may comprise a program which identifies an open reading frame in 
the cDNAs codes f SEQ ID NOs. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908, 3935 to 
7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 1 1773, 7866 to 10125, 10126 to 
11599, and 11600 to 11773. 
5 Figure 17 is a flow diagram illustrating one embodiment of an identifier process 300 

for detecting the presence of a feature in a sequence. The process 300 begins at a start state 
302 and then moves to a state 304 wherein a first sequence that is to be checked for features is 
stored to a memory 1 15 in the computer system 100. The process 300 then moves to a state 
306 wherein a database of sequence features is opened. Such a database would include a list 
10 of each feature's attributes along with the name of the feature. For example, a feature name 
could be "Initiation Codon" and the attribute would be **ATG*\ Another example would be 
the feature name "TAATAA Box" and the feature attribute would be "TAATAA". An 
example of such a database is produced by the University of Wisconsin Genetics Computer 

Group (www.gcg.com). 

1 5 Once the database of features is opened at the state 306, the process 300 moves to a 

state 308 wherein the first feature is read from the database. A comparison of the attribute of 
the first feature with the first sequence is then made at a state 310. A determination is then 
made at a decision state 316 whether the attribute of the feature was found in the first 
sequence. If the attribute was found, then the process 300 moves to a state 318 wherein the 
20 name of the found feature is displayed to the user. 

The process 300 then moves to a decision state 320 wherein a determination is made whether 
move features exist in the database. If no more features do exist, then the process 300 
terminates at an end state 324. However, if more features do exist in the database, then the 
process 300 reads the next sequence feature at a state 326 and loops back to the state 310 
25 wherein the attribute of the next feature is compared against the first sequence. 

It should be noted, that if the feature attribute is not found in the first sequence at the 
decision state 316, the process 300 moves directly to the decision state 320 in order to 
determine if any more features exist in the database. 

Accordingly, another aspect of the present invention is a method of identifying a 
30 feature within the nucleic acid codes of SEQ ID NOs. 1 to 3908, 1 to 2260, 2261 to 3374, 

3735 to 3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 1 1773, 7866 
to 10125, 10126 to 1 1599, and 1 1600 to 1 1773 comprising reading the nucleic acid code(s) 
through the use of a computer program which identifies features therein and identifying 
features within the nucleic acid code(s) with the computer program. In one embodiment, 
35 computer program comprises a computer program which identifies open reading frames. The 
method may be performed by reading a single sequence or at least 2, 5, 10, 15, 20, 25, 30, 50, 
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100, 200, 500, 1000, 2000, or 5000 of the nucleic acid codes of SEQ ID NOs. 1 to 3908, 1 to 
2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 
7866 to 11773, 7866 to 10125, 10126 to 11599, and 1 1600 to 1 1773 through the use of the 
computer program and identifying features within the nucleic acid codes with the computer 
5 program. 

The nucleic acid codes of SEQ ID NOs. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 
3908, 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 11773, 7866 to 10125, 
10126 to 11599, and 11600 to 11773 maybe stored and manipulated in a variety of data 
processor programs in a variety of formats. For example, the nucleic acid codes of SEQ ID 

10 NOs. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 3935 to 6194, 6195 to 
7668, 7669 to 7842, 7866 to 1 1773, 7866 to 10125, 10126 to 1 1599, and 1 1600 to 1 1773 may 
be stored as text in a word processing file, such as Microsoft WORD or WORDPERFECT or as 
an ASCII file in a variety of database programs familiar to those of skill in the art, such as DB2, 
SYBASE, or ORACLE. In addition, many computer programs and databases may be used as 

15 " sequence comparers, identifiers, or sources of reference nucleotide sequences to be compared to 
the nucleic acid codes of SEQ ID NOs. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908, 3935 
to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 1 1773, 7866 to 10125, 10126 to 
1 1599, and 1 1600 to 1 1773. The following list is intended not to limit the invention but to 
provide guidance to programs and databases which are useful with the nucleic acid codes of 

20 SEQ ID NOs. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908, 3935 to 7842, 3935 to 6194, 
6195 to 7668, 7669 to 7842, 7866 to 1 1773, 7866 to 10125, 10126 to 1 1599, and 1 1600 to 
11773. 

The programs and databases which may be used include, but are not limited to: 
MacPattem (EMBL), DiscoveryBase (Molecular Applications Group), GeneMine (Molecular 

25 Applications Group), Look (Molecular Applications Group), MacLook (Molecular Applications 
Group), BLAST and BLAST2 (NCBI), BLASTN and BLASTX (Altschul ct al, J. Mol Biol 
215: 403 (1990)), FASTA (Pearson and Lipman, Proc. Natl Acad. Scl USA, 85: 2444 (1988)), 
FASTDB (Brutlag et al. Comp. App. Biosci. 6:237-245, 1990), Catalyst (Molecular Simulations 
Inc.), Catalyst/SHAPE (Molecular Simulations Inc.), Cerius 2 .DBAccess (Molecular Simulations 

30 Inc.), HypoGen (Molecular Simulations Inc.), Insight II, (Molecular Simulations Inc.), Discover 
(Molecular Simulations Inc.), CHARMm (Molecular Simulations Inc.), Felix (Molecular 
Simulations Inc.), DelPhi, (Molecular Simulations Inc.), QuanteMM, (Molecular Simulations 
Inc.), Homology (Molecular Simulations Inc.), Modeler (Molecular Simulations Inc.), ISIS 
(Molecular Simulations Inc.), Quanta/Protein Design (Molecular Simulations Inc.), WebLab 

35 (Molecular Simulations Inc.), WebLab Diversity Explorer (Molecular Simulations Inc.), Gene 
Explorer (Molecular Simulations Inc.), SeqFold (Molecular Simulations Inc.), the MDL 
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Available Chemicals Directory database, the MDL Drug Data Report data base, the 
Comprehensive Medicinal Chemistry database, Derwents's World Drug Index database, the 
BioByteMasterFile database, the Genbank database, and the Genseqn database. Many other 
programs and data bases would be apparent to one of skill in the art given the present disclosure. 
5 Motifs which may be detected using the above programs include sequences encoding 

leucine zippers, helix-turn-helix motifs, glycosylation sites, ubiquitination sites, alpha helices, 
and beta sheets, signal sequences encoding signal peptides which direct the secretion of the 
encoded proteins, sequences implicated in transcription regulation such as homeoboxes, acidic 
stretches, enzymatic active sites, substrate binding sites, and enzymatic cleavage sites. 
10 it should be noted that the nucleic acid codes of the invention further encompass all of 

the polynucleotides disclosed, described or claimed in the present application. Moreover, the 
present invention specifically contemplates the storage of such codes on computer readable 
media and computer systems individually or in any combination, as well as the use of such codes 
and combinations in the methods of VI. 
15 VII, Mapping and Maps Comprising the Biallelic Mark ers of the Invention 

The human haploid genome contains an estimated 80,000 to 100,000 or more genes 
scattered on a 3 x 10 9 base-long double stranded DNA shared among the 24 chromosomes. 
Each human being is diploid, i.e. possesses mo haploid genomes, one from paternal origin, 
the other from maternal origin. The sequence of the human genome varies among individuals 
20 in a population About 10 7 sites scattered along the 3 x 10* base pairs of DNA are polymorphic, 
existing in at least two variant forms called alleles. Most of these polymorphic sites are 
generated by single base substitution mutations and arc biallelic. Less than 10 5 polymorphic 
sites are due to more complex changes and are very often multi-allelic, i.e. exist in more than 
two allelic forms. At a given polymorphic site, any individual (diploid), can be either 
25 homozygous (twice the same allele) or heterozygous (two different alleles). A given 

polymorphism or rare mutation can be either neutral (no effect on trait), or functional, i.e. 
responsible for a particular genetic trait. 
Genetic Maps 

The first step towards the identification of genes associated with a detectable trait, 
30 such as a disease or any other detectable trait, consists in the localization of genomic regions 
containing trait-causing genes using genetic mapping methods. The preferred traits 
contemplated within the present invention relate to fields of therapeutic interest; in particular 
embodiments, they will be disease traits and/or drug response traits, reflecting drug efficacy or 
toxicity. Traits can either be "binary", e.g. diabetic vs. non diabetic, or "quantitative", e.g. 
35 elevated blood pressure. Individuals affected by a quantitative trait can be classified 

according to an appropriate scale of trait values, e.g. blood pressure ranges. Each trait value 
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range can then be analyzed as a binary trait. Patients showing a trait value within one such 

range will be studied in comparison with patients showing a trait value outside of this range. 

In such a case, genetic analysis methods will be applied to subpopulations of individuals 

showing trait values within defined ranges. 
5 Genetic mapping involves the analysis of the segregation of polymorphic loci in trait 

positive and trait-negative populations. Polymorphic loci constitute a small fraction of the 

human genome (less than 1%), compared to the vast majority of human genomic DNA which 

is identical in sequence among the chromosomes of different individuals. Among all existing 

human polymorphic loci, genetic markers can be defined as genomc-dcrived polynucleotides 
10 which are sufficiently polymorphic to allow a reasonable probability that a randomly selected 

person will be heterozygous, and thus informative for genetic analysis by methods such as 

linkage analysis or association studies. 

A genetic map consists of a collection of polymorphic markers which have been 

positioned on the human chromosomes. Genetic maps may be combined with physical maps, 
1 5 collections of ordered overlapping fragments of genomic DNA whose arrangement along the 

human chromosomes is known. The optimal genetic map should possess the following 

characteristics: 

- the density of the genetic markers scattered along the genome should be sufficient to 
allow the identification and localization of any trait-related polymorphism, 

20 - each marker should have an adequate level of heterozygosity, so as to be informative 

in a large percentage of different meioses, 

- all markers should be easily typed on a routine basis, at a reasonable expense, and in 
a reasonable amount of time, 

- the entire set of markers per chromosome should be ordered in a highly reliable 

25 fashion. 

However, while the above maps arc optimal, it will be appreciated that the maps of the 
present invention may be used in the individual marker and haplotype association analyses 
described below without the necessity of determining the order of biallelic markers derived 
from a single BAC with respect to one another. 

30 Construction of a Physical Map 

The first step in constructing a high density genetic map of biallelic markers is the 
construction of a physical map. Physical maps consist of ordered, overlapping cloned 
fragments of genomic DNA covering a portion of the genome, preferably covering one or all 
chromosomes. Obtaining a physical map of the genome entails constructing and rdering a 

35 genomic DNA library. For an example of a complete explanation of the construction of a 

physical map from a BAC library see related PCT Application No. PCT/EB98/00193 filed July 
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19, 1998. The methods disclosed therein can be used to generate larger more complete sets of 
markers and entire maps of the human genome comprising the map-relate biallelic markers of 
the invention. 
Riallelic Markers 

5 It will be appreciated that the ordered DNA fragments containing these groups of 

biallelic markers need not completely cover the genomic regions of these lengths but may instead 
be incomplete contigs having one or more gaps therein. As discussed in further detail below, 
biallelic markers may be used in single maker and haplotype association analyses regardless of 
the completeness of the corresponding physical contig harboring them. 
IQ Using the procedures above, 3908 biallelic markers, each having two alleles, were 

identified using sequences obtained from BACs which had been localized on the genome. In 
some cases, markers were identified using pooled BACs and thereafter reassigned to 
individual BACs using STS screening procedures such as those described in Examples 1 and 
2. The sequences of these biallelic markers are provided in the accompanying Sequence 
15 Listing as SEQ ID Nos. 1 to 3908. Although the sequences of SEQ ID Nos. 1 to 3908 will be 
used as exemplary markers throughout the present application, these markers are not limited to 
markers having the exact flanking sequences surrounding the polymorphic bases which are 
enumerated in SEQ ID Nos. 1 to 3908. Rather, it will be appreciated that the flanking 
sequences surrounding the polymorphic bases of SEQ ID Nos. 1 to 3908 maybe lengthened or 
20 shortened to any extent compatible with their intended use and the present invention 

specifically contemplates such sequences. The sequences of these biallelic markers may be 
used to construct genomic maps as well as in the gene identification and diagnostic techniques 
described herein. It will be appreciated that the biallelic markers referred to herein may be of 
any length compatible with their intended use provided that the markers include the 
25 polymorphic base, and the present invention specifically contemplates such sequences. 

Some of the markers of SEQ ID Nos: 1 to 3908 as well as related amplification and 
microsequencing primers were disclosed in the instant priority documents. However, some of 
the earlier described amplification primers and microsequencing primers did not have the 
precise sequence lengths disclosed in the instant application. It will be appreciated that either 
30 length of primers may be used in the methods disclosed in the present application. 

In addition, the internal identification numbers used to identify the biallelic markers 
disclosed in U.S. Provisional Patent Application Serial No. 60/082,614 filed April 21, 1998 
have been revised to include additional numbers on the end. For example, the marker 
formerly given the internal identification number 99-1091 was given the revised internal 
35 identification number 99-1091-446. Therefore, it will be appreciated that shortened 
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identificati n numbers and extended identification numbers which overlap one another refer to 

the same markers. 

Ordering of hiallelic markers 

Biallclic markers can be ordered to determine their positions along chromosomes, 
5 preferably subchromosomal regions, by methods known in the art as well as those disclosed in 
PCT Application No. PCT/IB98/00193 filed July 19, 1998, and U.S. Provisional Patent 
Application Serial No. 60/082,614 filed April 21, 1998. 

The positions of the biallelic markers along chromosomes may be determined using a 
variety of methodologies. In one approach, radiation hybrid mapping is used. Radiation hybrid 
1 0 (RH) mapping is a somatic cell genetic approach that can be used for high resolution mapping of 
the human genome. In this approach, cell lines containing one or more human chromosomes are 
lethally irradiated, breaking each chromosome into fragments whose size depends on the 
radiation dose. These fragments are rescued by fusion with cultured rodent cells, yielding 
subclones containing different portions of the human genome. This technique is described by 
15 Benham ct al. (Genomics 4:509-517, 1989) and Cox et al., (Science 250:245-250, 1990). The 
random and independent nature of the subclones permits efficient mapping of any human 
genome marker. Human DNA isolated from a panel of 80-100 cell lines provides a mapping 
reagent for ordering biallelic markers. In this approach, the frequency of breakage between 
markers is used to measure distance, allowing construction of fine resolution maps as has been 
20 done for ESTs (Schulcr et al., Science 274:540-546, 1996). 

RH mapping has been used to generate a high-resolution whole genome radiation hybrid 
map of human chromosome 17q22^25.3 across the genes for growth hormone (GH) and 
thymidine kinase (TK) (Foster ctal., Genomics 33:185-192, 1996), the region surrounding the 
Gorlin syndrome gene (Obermayr et al.,£ U r. J. Hum. Genet. 4:242-245, 1996), 60 loci covering 
25 the entire short arm of chromosome 12 (Raeymaekers et al. Genomics 29:170-178, 1995), the 
region of human chromosome 22 containing the neurofibromatosis type 2 locus (Frazer et al. 
Genomics 14:574-584, 1992) and 13 loci on the long arm of chromosome 5 (Warrington et al. 
Genomics 11:701-708, 1991). 

Alternatively, PCR based techniques and human-rodent somatic cell hybrids may be 
30 used to determine the positions of the biallclic markers on the chromosomes. In such 

approaches, oligonucleotide primer pairs which are capable of generating amplification products 
containing the polymorphic bases of the biallelic markers are designed. Preferably, the 
oligonucleotide primers are 18-23 bp in length and are designed for PCR amplification. The 
creation of PCR primers from known sequences is well known to those with skill in the art. For 
35 a review of PCR technology see Erlich, H.A, PCR Technology: Principles and Applications for 
DNA Amplification. 1992. W.H. Freeman and Co, New York 
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The primers are used in polymerase chain reactions (PCR) to amplify templates from 
total human genomic DNA. PCR conditions are as follows: 60 ng of genomic DNA is used as a 
template for PCR with 80 ng of each oligonucleotide primer, 0.6 unit of Taq polymerase, and 1 
mCu of a 32 P-labeled deoxycytidine triphosphate. The PCR is performed in a microplate 

5 thermocycler (Techne) under the following conditions: 30 cycles of 94°C, 1 .4 min; 55°C, 2 min; 
and 72°C, 2 min; with a final extension at 72°C for 10 min. The amplified products are analyzed 
on a 6% polyacrylamide sequencing gel and visualized by autoradiography. If the length of the 
resulting PCR product is identical to the length expected for an amplification product containing 
the polymorphic base of the biallelic marker, then the PCR reaction is repeated with DNA 

1 0 templates from two panels of human-rodent somatic cell hybrids, BIOS PCRable DNA (BIOS 
Corporation) and N1GMS Human-Rodent Somatic Cell Hybrid Mapping Panel Number 1 

(NIGMS, Camden, NJ). 

PCR is used to screen a series of somatic cell hybrid cell lines containing defined sets of 
human chromosomes for the presence of a given biallelic marker. DNA is isolated from the 
15 somatic hybrids and used as starting templates for PCR reactions using the primer pairs from the 
biallelic marker. Only those somatic cell hybrids with chromosomes containing the human 
sequence corresponding to the biallelic marker will yield an amplified fragment. The biallelic 
markers are assigned to a chromosome by analysis of the segregation pattern of PCR products 
from the somatic hybrid DNA templates. The single human chromosome present in all cell 
20 hybrids that give rise to an amplified fragment is the chromosome containing that biallelic 
marker. For a review of techniques and analysis of results from somatic cell gene mapping 
experiments. (See Ledbcttcr ct al., Genomics 6:475-481 (1990).) 

Example 2 describes a preferred method for positioning of biallelic markers on 
clones, such as BAC clones, obtained from genomic DNA libraries. Using such procedures, a 
25 number of BAC clones carrying selected biallelic markers can be isolated. The position of 
these BAC clones on the human genome can be defined by performing STS screening as 
described in Example 1 . Preferably, to decrease the number of STSs to be tested, each BAC 
can be localized on chromosomal or subchromosomal regions by procedures such as those 
described in Examples 3 and 4. This localization will allow the selection of a subset of STSs 
30 corresponding to the identified chromosomal or subchromosomal region. Testing each BAC 
with such a subset of STSs and taking account of the position and order of the STSs along the 
genome will allow a refined positioning of the corresponding biallelic marker along the 
genome. 

In other embodiments, if the DNA library used to isolate BAC inserts or any type of 
35 genomic DNA fragments harboring the selected biallelic markers already constitute a physical 
map of the genome or any portion thereof, using the known order of the DNA fragments will 
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allow the order of the biallelic markers to be established. 

As discussed above, it will be appreciated that markers carried by the same fragment of 
genomic DNA, such as the insert in a BAC clone, need not necessarily be ordered with respect to 
one another within the genomic fragment to conduct single point or haplotype association 
5 analyses. However, in other embodiments of the present maps, the order of biallelic markers 
carried by the same fragment of genomic DNA may be determined. 

The positions of the biallelic markers used to construct the maps of the present 
invention, including the map-related biallelic markers of the invention, may be assigned to 
subchromosomal locations using Fluorescence In Situ Hybridization (FISH) (Cherif et aL, 
10 Proc. Natl Acad, ScL USA., 87:6639-6643 (1990)). FISH analysis is described in Example 3. 

The ordering analyses may be conducted to generate an integrated genome wide 
genetic map comprising about 20,000, 40,000, 60,000, 80,000, 100,000, 120,000 biallelic 
markers with a roughly consistent number of biallelic marker per BAC. In some 
embodiments, the map includes one or more markers selected from the group consisting of the 
15 " sequences of SEQ ID Nos. 1 to 3908, I to 2260, 2261 to 3374, 3735 to 3908 or the sequences 
complementary thereto. 

Alternatively, maps having the above-specified average numbers of biallelic markers 
per BAC which comprise smaller portions of the genome, such as a set of chromosomes, a 
single chromosome, a particular subchromosomal region, or any other desired portion of the 
20 genome, may also be constructed using the procedures provided herein. 

In some embodiments, the biallelic markers in the map are separated from one another 
by an average distance of 10-200kb, 15-1 50kb, 20-100kb, 100-150kb, 50-100kb, or 25-50kb. 
Maps having the above-specified intermarker distances which comprise smaller portions of the 
genome, such as a set of chromosomes, a single chromosome, a particular subchromosomal 
25 region, or any other desired portion of the genome, may also be constructed using the 
procedures provided herein. 

Figure 2, showing the results of computer simulations of the distribution of inter- 
marker spacing on a randomly distributed set of biallelic markers, indicates the percentage of 
biallelic markers which will be spaced a given distance apart for a given number of 
30 markers/BAC in the genomic map (assuming 20,000 BACs constituting a minimally 

overlapping array covering the entire genome are evaluated). One hundred iterations were 
performed for each simulation (20,000 marker map, 40,000 marker map, 60,000 marker map, 
120,000 marker map). 

As illustrated in Figure 2a, 98% f inter-marker distances will be lower than 1 50kb 
35 provided 60,000 evenly distributed markers arc generated (3 per BAC); 90% of inter-marker 
distances will be lower than 150kb provided 40,000 evenly distributed markers are generated 
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(2 per BAC); and 50% of inter-marker distances will be lower than 150kb provided 20,000 
evenly distributed markers are generated (1 per BAC). 

As illustrated in Figure 2b, 98% of inter-marker distances will be lower than 80kb 
provided 120,000 evenly distributed markers are generated (6 per BAC); 80% of inter-marker 
5 distances will be lower than 80kb provided 60,000 evenly distributed markers are generated (3 
per BAC); and 15% of inter-marker distances will be lower than 80kb provided 20,000 evenly 
distributed markers are generated (1 per BAC). 

As already mentioned, high density biallelic marker maps allow association studies to 
be performed to identify genes involved in complex traits. 
10 Linkage Disequilibrium 

The present invention then also concerns biallelic markers in linkage disequilibrium 
with the specific biallelic markers described above and which are expected to present similar 
characteristics in terms of their respective association with a given trait. In a preferred 
embodiment, the present invention concerns the biallelic markers that are in linkage 
15 disequilibrium with the biallelic markers of SEQ ID Nos. 1 to 3908, 1 to 2260, 2261 to 3374, 
3735 to 3908 or the sequences complementary thereto. 

LD among a set of biallelic markers having an adequate heterozygosity rate can be 
determined by genotyping between 50 and 1000 unrelated individuals, preferably between 75 
and 200, more preferably around 100. Genotyping a biallelic marker consists of determining 
20 the specific allele carried by an individual at the given polymorphic base of the biallelic 

marker. Genotyping can be performed using similar methods as those described above for the 
generation of the biallelic markers, or using other genotyping methods such as those further 
described below. 

Genome-wide linkage disequilibrium mapping aims at identifying, for any trait- 
25 causing allele being searched, at least one biallelic marker in linkage disequilibrium with said 
trait-causing allele. Preferably, in order to enhance the power of linkage disequilibrium maps, 
in some embodiments, the biallelic markers therein have average inter-marker distances of 
150kb or less, 75 kb or less, or 50 kb or less, 30kb or less, or 25kb or less to accommodate the 
fact that, in some regions of the genome, the detection of linkage disequilibrium requires 
30 lower inter-marker distances . 

The present invention provides methods to generate biallelic marker maps with 
average inter-markcr distances of 150kb or less. In some embodiments, the mean distance 
between biallelic markers constituting the high density map will be less than 75kb, preferably 
less than 50kb. Further preferred maps according to the present invention contain markers that 
35 are less than 37.5kb apart. In highly preferred embodiments, the average inter-marker spacing 
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for the biallelic markers constituting very high density maps is less than 30kb, most preferably 
less than 25kb. 

Genetic maps containing biallelic markers (including the biallelic markers of SEQ ED 
Nos. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908 or the sequences complementary 
5 thereto) may be used to identify and isolate genes associated with detectable traits. The use of 
the genetic maps of the present invention is described in more detail below. 
V11T. Use of High Density Biallelic Marker Maps to Id entify Genes Associated with 
Detectable Traits 

One embodiment of the present invention comprises methods for identifying and 
10 isolating genes associated with a detectable trait using the biallelic marker maps of the present 
invention. 

In the past, the identification of genes linked with detectable traits has relied on a 
statistical approach called linkage analysis. Linkage analysis is based upon establishing a 
correlation between the transmission of genetic markers and that of a specific trait throughout 
1 5 generations within a family. In this approach, all members of a series of affected families are 
genotyped with a few hundred markers, typically microsatellitc markers, which are distributed 
at an average density of one every 10 Mb. By comparing genotypes in all family members, 
one can attribute sets of alleles to parental haploid genomes (haplotyping or phase 
determination). The origin of recombined fragments is then determined in the offspring of all 
20 families. Those that co-segregate with the trait are tracked. After pooling data from all 

families, statistical methods are used to determine the likelihood that the marker and the trait 
arc segregating independently in all families. As a result of the statistical analysis, one or 
several regions having a high probability of harboring a gene linked to the trait are selected as 
candidates for further analysis. The result of linkage analysis is considered as significant (i.e. 
25 there is a high probability that the region contains a gene involved in a detectable trait) when 
the chance of independent segregation of the marker and the trait is lower than 1 in 1000 
(expressed as a LOD score > 3). Generally, the length of the candidate region identified using 
linkage analysis is between 2 and 20Mb. 

Once a candidate region is identified as described above, analysis of recombinant 
30 individuals using additional markers allows further delineation of the candidate linked region. 

Linkage analysis studies have generally relied on the use of a maximum of 5,000 
microsatellitc markers, thus limiting the maximum theoretical attainable resolution of linkage 
analysis to ca. 600 kb on average. 

Linkage analysis has been successfully applied to map simple genetic traits that show 
35 clear Mendclian inheritance patterns and which have a high penetrance (penetrance is the ratio 
between the number of trait-positive carriers of allele a and the total number of a carriers in 
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the population). About 100 pathological trait-causing genes were discovered using linkage 

analysis over the last 10 years. In most of these cases, the majority of affected individuals had 

affected relatives and the detectable trait was rare in the general population (frequencies less 

than 0.1%). In about 10 cases, such as Alzheimer's Disease, breast cancer, and Type II 
5 diabetes, the detectable trait was more common but the allele associated with the detectable 

trait was rare in the affected population. Thus, the alleles associated with these traits were not 

responsible for the trait in all sporadic cases. 

Linkage analysis suffers from a variety of drawbacks. First, linkage analysis is 

limited by its reliance on the choice of a genetic model suitable for each studied trait. 
10 Furthermore, as already mentioned, the resolution attainable using linkage analysis is limited, 

and complementary studies are required to refine the analysis of the typical 2Mb to 20Mb 

regions initially identified through linkage analysis. 

In addition, linkage analysis approaches have proven difficult when applied to 

complex genetic traits, such as those due to the combined action of multiple genes and/or 
1 5 environmental factors. In such cases, too large an effort and cost are needed to recruit the 

adequate number of affected families required for applying linkage analysis to these 

situations, as recently discussed by Risch, N. and Merikangas, K. (Science 273:1516-1517 

(1996)). 

Finally, linkage analysis cannot be applied to the study of traits for which no large 
20 informative families are available. Typically, this will be the case in any attempt to identify 
trait-causing alleles involved in sporadic cases, such as alleles associated with positive or 
negative responses to drug treatment. 

The present genetic maps and biallelic markers (including the biallelic markers of 
SEQ ID Nos. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908 or the sequences 
25 complementary thereto) may be used to identify and isolate genes associated with detectable 
traits using association studies, an approach which does not require the use of affected 
families and which permits the identification of genes associated with sporadic traits. 
Association Studies 

As already mentioned, any gene responsible or partly responsible for a given trait will 
30 be in linkage disequilibrium with some flanking markers. To map such a gene, specific alleles 
of these flanking markers which are associated with the gene or genes responsible for the trait 
are identified. Although the following discussion of techniques for finding the gene or genes 
associated with a particular trait using linkage disequilibrium mapping, refers to locating a 
single gene which is responsible for the trait, it will be appreciated that the same techniques 
35 may also be used to identify genes which arc partially responsible for the trait. 
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Association studies may be conducted within the general population (as opposed to 
the linkage analysis techniques discussed above which are limited to studies performed on 
related individuals in one or several affected families). 

Association between a biallelic marker A and a trait T may primarily occur as a result 
5 of three possible relationships between the biallelic marker and the trait. 

First, allele a of biallelic marker A may be directly responsible for trait T (e.g., Apo E 
e 4 site A and Alzheimer's disease). However, since the majority of the biallelic markers 
used in genetic mapping studies are selected randomly, they mainly map outside of genes. 
Thus, the likelihood of allele a being a functional mutation directly related to trait T is very 
10 low. 

Second, an association between a biallelic marker A and a trait T may also occur 
when the biallelic marker is very closely linked to the trait locus. In other words, an 
association occurs when allele a is in linkage disequilibrium with the trait-causing allele. 
When the biallelic marker is in close proximity to a gene responsible for the trait, more 
1 5 extensive genetic mapping will ultimately allow a gene to be discovered near the marker locus 
which carries mutations in people with trait T (i.e. the gene responsible for the trait or one of 
the genes responsible for the trait). As will be further exemplified below, using a group of 
biallelic markers which are in close proximity to the gene responsible for the trait the location 
of the causal gene can be deduced from the profile of the association curve between the 
20 biallelic markers and the trait. The causal gene will usually be found in the vicinity of the 
marker showing the highest association with the trait. 

Finally, an association between a biallelic marker and a trait may occur when people 
with the trait and people without the trait correspond to genetically different subsets of the 
population who, coincidcntally, also differ in the frequency of allele a (population 
25 stratification). This phenomenon may be avoided by using ethnically matched large 
heterogeneous samples. 

Association studies are particularly suited to the efficient identification of genes that 
present common polymorphisms, and are involved in multifactorial traits whose frequency is 
relatively higher than that of diseases with monofactorial inheritance. 
30 Association studies mainly consist of four steps: recruitment of trait-positive (T+) and 

control populations, preferably trait-negative (T-) populations with well-defined phenotypes, 
identification of a candidate region suspected of harboring a trait causing gene, identification 
of said gene among candidate genes in the region, and finally validation of mutation(s) 
responsible for the trait in said trait causing gene. 
35 In a first step, the trait-positive should be well-defined, preferably the control 

phenotype is a well-defined trait-negative phenotype as well. In order to perform efficient and 
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significant association studies such as those described herein, the trait under study should 
preferably follow a bimodal distribution in the population under study, presenting two clear 
non-overlapping phenotypes, trait-positive and trait-negative. 

Nevertheless, in the absence of such a bimodal distribution (as may in fact be the case 

5 for complex genetic traits), any genetic trait may still be analyzed using the association 
method proposed herein by carefully selecting the individuals to be included in the trait- 
positive group and preferably the trait-negative phcnotypic group as well. The selection 
procedure ideally involves selecting individuals at opposite ends of the non-bimodal 
phenotype spectrum of the trait under study, so as to include in these trait-positive and trait- 

10 negative populations individuals who clearly represent non-overlapping, preferably extreme 
phenotypes. 

As discussed above, the definition of the inclusion criteria for the trait-positive and 
control populations is an important aspect of the present invention. 

Figure 3 shows, for a series of hypothetical sample sizes, the p-value significance 
15 obtained in association studies performed using individual markers from the high-density 

biallelic map, according to various hypotheses regarding the difference of allelic frequencies 
between the trait-positive and trait-negative samples. It indicates that, in all cases, samples 
ranging from 150 to 500 individuals are numerous enough to achieve statistical significance. 
It will be appreciated that bigger or smaller groups can be used to perform association studies 
20 according to the methods of the present invention. 

In a second step, a marker/trait association study is performed that compares the 
genotype frequency of each biallelic marker in the above described trait-positive and trait- 
negative populations by means of a chi square statistical test (one degree of freedom). In 
addition to this single marker association analysis, a haplotype association analysis is 
25 performed to define the frequency and the type of the ancestral carrier haplotype. Haplotype 
analysis, by combining the informativeness of a set of biallelic markers increases the power of 
the association analysis, allowing false positive and/or negative data that may result from the 
single marker studies to be eliminated. 

Genotyping can be performed using any method described in HI, including the 
30 microsequencing procedure described in Example 8. 

If a positive association with a trait is identified using an array of biallelic markers 
having a high enough density, the causal gene will be physically located in the vicinity of the 
associated markers, since the markers showing positive association with the trait are in linkage 
disequilibrium with the trait locus. Regions harboring a gene responsible for a particular trait 
35 which arc identified through association studies using high density sets of biallelic markers 
will, on average, be 20 - 40 times shorter in length than those identified by linkage analysis. 
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Once a positive association is confirmed as described above, a third step consists of 
completely sequencing the BAC inserts harboring the markers identified in the association 
analyzes. These BACs are obtained through screening human genomic libraries with the 
markers probes and/or primers, as described above. Once a candidate region has been 

5 sequenced and analyzed, the functional sequences within the candidate region (e.g. exons, 
splice sites, promoters, and other potential regulatory regions) are scanned for mutations 
which are responsible for the trait by comparing the sequences of the functional regions in a 
selected number of trait-positive and trait-negative individuals using appropriate software. 
Tools for sequence analysis are further described in Example 9. 

10 Finally, candidate mutations are then validated by screening a larger population of 

trait-positive and trait-negative individuals using genotyping techniques described below. 
Polymorphisms are confirmed as candidate mutations when the validation population shows 
association results compatible with those found between the mutation and the trait in the test 
population. 

15 m practice, in order to define a region bearing a candidate gene, the trait-positive and 

trait-negative populations are genotyped using an appropriate number of biallelic markers. 
The markers may include one or more of the markers of SEQ ID Nos: 1 to 3908, 1 to 2260, 
2261 to 3374, 3735 to 3908 or the sequences complementary thereto. 

The markers used to define a region bearing a candidate gene may be distributed at an 
20 average density of 1 marker per 10-200 kb. Preferably, the markers used to define a region 

bearing a candidate gene are distributed at an average density of 1 marker every 15-150 kb. In 
further preferred embodiments, the markers used to define a region bearing a candidate gene 
are distributed at an average density of 1 marker every 20-100kb. In yet another preferred 
embodiment, the markers used to define a region bearing a candidate gene are distributed at an 
25 average density of 1 marker every 1 00 to 1 50kb. In a further highly preferred embodiment, 
the markers used to define a region bearing a candidate gene are distributed at an average 
density of 1 marker every 50 to lOOkb. In yet another embodiment, the biallelic markers used 
to define a region bearing a candidate gene are distributed at an average density of 1 marker 
every 25-50 kilobases. As mentioned above, in order to enhance the power of linkage 
30 disequilibrium based maps, in a preferred embodiment, the marker density of the map will be 
adapted to take the linkage disequilibrium distribution in the genomic region of interest into 
account. 

In some embodiments, the initial identification of a candidate genomic region 
harboring a gene associated with a detectable phenotype may be conducted using a 
35 preliminary map containing a few thousand biallelic markers. Thereafter, the genomic region 
harboring the gene responsible for the detectable trait may be better delineated using a map 
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containing a larger number of biallelic markers. Furthermore, the genomic region harboring 
the gene responsible for the detectable trait may be further delineated using a high density 
map of biallelic markers. Finally, the gene associated with the detectable trait may be 
identified and isolated using a very high density biallelic marker map. 
5 Example 6 describes a procedure for identifying a candidate region harboring a gene 

associated with a detectable trait and provides simulated results for this procedure. It will be 
appreciated that although Example 6 compares the results of simulated analyzes using markers 
derived from maps having 3,000, 20,000, and 60,000 markers, the number of markers 
contained in the map is not restricted to these exemplary figures. Rather, Example 6 
1 0 exemplifies the increasing refinement of the candidate region with increasing marker density. 
As increasing numbers of markers are used in the analysis, points in the association analysis 
become broad peaks. The gene associated with the detectable trait under investigation will lie 
within or near the region under the peak. 

The statistical power of linkage disequilibrium mapping using a high density marker 
1 5 map is also reinforced by complementing the single point association analysis described above 
with a multi -marker association analysis of haplotype analysis described in IV. To improve the 
statistical power of the individual marker association analyses conducted as described above 
using maps of increasing marker densities, haplotype studies can be performed using groups of 
markers located in proximity to one another within regions of the genome. For example, using 
20 the methods described above in which the association of an individual marker with a 

detectable phenotypc was analyzed using maps of 3,000 markers, 20,000 markers, and 60,000 
markers, a series of haplotype studies can be performed using groups of contiguous markers 
from such maps or from maps having higher marker densities. 

In a preferred embodiment, a scries of successive haplotype studies including groups 
25 of markers spanning regions of more than 1 Mb may be performed. In some embodiments, the 
biallelic markers included in each of these groups may be located within a genomic region 
spanning less than lkb, from 1 to 5kb, from 5 to lOkb, from 10 to 25kb, from 25 to 50kb, from 
50 to 150kb, from 150 to 250kb, from 250 to 500kb, from 500kb to 1Mb, or more than 1Mb. 
Preferably, the genomic regions containing the groups of biallelic markers used in the 
30 successive haplotype analyses are overlapping. It will be appreciated that the groups of 

biallelic markers need not completely cover the genomic regions of the above-specified lengths 
but may instead be obtained from incomplete contigs having one or more gaps therein. As 
discussed in further detail below, biallelic markers may be used in single point and haplotype 
association analyses regardless of the completeness of the corresponding physical contig 
35 harboring them. 

Genome-wide mapping using association studies with dense enough arrays of markers 
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permit a case-by-case best estimate of p-value significance thresholds. Given a test population 
comprising two ethnically matched trait-positive and control groups of about 50 to about 500 
individuals or more, conducting the above described association studies will allow a p-value 
"cut-off to be established by, for example, analyzing significant numbers of allele frequency 
differences or, in some cases where appropriate, running computer simulations or control 
studies as described in Examples 6, 15, and 26. 

For a p-value above the threshold, a corresponding association between the trait and a 
studied marker will be deemed not significant, while for a p-value below such a threshold, said 
association will be deemed significant. If the p-value is significant, the genomic region 
around the marker will be further scrutinized for a trait-causing gene. 

It is preferred that p-value significance thresholds be assessed for each case/control 
population comparison. Both the genetic distance between sampled population- 
"stratification"-and the dispersion due to random selection of samples may indeed influence 
the p-value significance thresholds. 
1 5 lt win be appreciated that the above approaches may be conducted on any scale (i.e. 

over the whole genome, a set of chromosomes, a single chromosome, a particular 
subchromosomal region, or any other desired portion of the genome). As mentioned above, 
once significance thresholds have been assessed, population sample sizes may be adapted as 
exemplified in Figure 3. 

20 Example 7 below illustrates the increase in statistical power brought to an association 

study by a haplotype analysis. 

The results described in Examples 5 and 7, generated from individual and haplotype 
studies using a biallelic marker set of an average density equal to ca. 40kb in the region of an 
Alzheimer's disease trait causing gene, indicate that all biallelic markers of sufficient 
25 informative content located within a ca. 200 kb genomic region around a trait-causing allele 

can potentially be successfully used to localize a trait causing gene with the methods provided 
by the present invention. This conclusion is further supported by the results obtained through 
measuring the linkage disequilibrium between markers 99-365-344 or 99-359-308 and ApoE 4 
Site A marker within Alzheimer's patients: as one could predict since linkage disequilibrium 
30 is the supporting basis for association studies, linkage disequilibrium between these pairs of 
markers was enhanced in the diseased population vs. the control population. In a similar way 
as the haplotype analysis enhanced the significance of the corresponding association studies. 

Once a given polymorphic site has been found and characterized as a biallelic marker 
according to the methods of the present invention, several methods can be used in order to 
3 5 determine the specific allele carried by an individual at the given polymorphic base as 
described in III. 
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T nnation of a Gene Associate d with Detectable Traits 

Once the candidate region has been delineated using the high density biallelic marker 
map, a sequence analysis process will allow the detection of all genes located within said 
region, together with a potential functional characterization of said genes. The identified 

5 functional features may allow preferred trait-causing candidates to be chosen from among the 
identified genes. More biallelic markers may then be generated within said candidate genes, 
and used to perform refined association studies that will support the identification of the trait 
causing gene. Sequence analysis processes are described in Example 9. 

Examples 10-1 8 illustrate the application of the above methods using biallelic markers 

10 to identify a gene associated with a complex disease, prostate cancer, within a ca. 450 kb 

candidate region. Additional details of the identification of the gene associated with prostate 
cancer are provided in the U.S. Patent Application entitled "Prostate Cancer Gene" Serial No. 
08/996,306. 

The above methods were also used to identifybiallelic markers in a gene which was 
1 5 an attractive candidate for a gene associated with asthma. Examples 1 9-26 show how the use 
of methods of the present invention allowed this gene to be identified as a gene responsible, at 
least partially, for asthma in the studied populations. Additional details of the identification of 
the gene associated with asthma are provided in U.S. Provisional Application Serial Nos. 
60/081,893. 

20 Alternatively, genes associated with detectable traits may be. identified as follows. 

Candidate genomic regions suspected of harboring a gene associated with the trait may be 
identified using techniques such as those described herein. In such techniques, the allelic 
frequencies of biallelic markers are compared in nucleic acid samples derived from 
individuals expressing the detectable trait and individuals who do not express the detectable 
25 trait. In this manner, candidate genomic regions suspected of harboring a gene associated with 
the detectable trait under investigation are identified. 

The existence of one or more genes associated with the detectable trait within the 
candidate region is confirmed by identifying more biallelic markers lying in the candidate 
region. A first haplotype analysis is performed for each possible combination of groups of 
30 biallelic markers within the genomic region suspected of harboring a trait-associated gene. 
For example, each group may comprise three biallelic markers. For each of the groups of 
markers, the frequency of each possible haplotype (for groups of three markers there arc 8 
possible haplotypes) in individuals expressing the trait and individuals who do not express the 
trait is estimated. For example, the a haplotype estimation method is applied as described in 
35 IV. for example the haplotype frequencies may be estimated using the Expectation- 
Maximization method of Excoffier L and Slatkin M, Mol. Biol. Evol. 12:921-927 (1995). 
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The frequencies of each of the possible haplotypes of the grouped markers (or each 
allele of individual markers) in individuals expressing the trait and individuals who do not 
express the trait are compared. For example, the frequencies may be compared by performing 
a chi-squared analysis. Within each group, the haplotype (or the allele of each individual 
5 marker) having the greatest association with the trait is selected. This process is repeated for 
each group of biallelic markers (or each allele of the individual markers) to generate a 
distribution of association values, which will be referred to herein as the "trait-associated" 
distribution. 

A second haplotype analysis is performed for each possible combination of groups of 
10 biallelic markers within the genomic regions which are not suspected of harboring a trait- 
associated gene. For example, each group may comprise three biallelic markers. For each of 
the groups of markers, the frequency of each possible haplotype (for groups of three markers 
there aTe 8 possible haplotypes) in individuals expressing the trait and individuals who do not 
express the trait is estimated. 
\ 5 The frequencies of each of the possible haplotypes of the grouped markers (or each 

allele of individual markers) in individuals expressing the trait and individuals who do not 
express the trait are compared. For example, the frequencies may be compared by performing 
a chi-squared analysis. Within each group, the haplotype (or the allele of each individual 
marker) having the greatest association with the trait is selected. This process is repeated for 
20 each group of biallelic markers (or each allele of the individual markers) to generate a 
distribution of association values, which will be referred to herein as the "random" 
distribution. 

The trait-associated distribution and the random distribution are then compared to one 
another to determine if there are significant differences between them. For example, the trait- 

25 associated distribution and the random distribution can be compared using either the 

Wilcoxon rank test (Noether, G.E. (1991) Introduction to statistics: "The nonpararnetric 
way", Springer-Verlag, New York, Berlin) or the Kolmogorov-Smirnov test (Saporta, G. 
(1990) "Probalites, analyse des donnees et statistiques" Technip editions, Paris) or both the 
Wilcoxon rank test and the Kolmogorov-Srnirnov test. 

30 If the trait-associated distribution and the random distribution are found to be 

significantly different, the candidate genomic region is highly likely to contain a gene 
associated with the detectable trait. Accordingly, the candidate genomic region is evaluated 
more fully to isolate the trait-associated gene. Alternatively, if the trait-associated distribution 
and the Tandom distribution are equal using the above analyses, the candidate genomic region 

35 is unlikely to contain a gene associated with the detectable trait. Accordingly, no further 
analysis of the candidate genomic region is performed. 
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While Examples 10 to 26 illustrate the use of the maps and markers of the present 
invention for identifying a new gene associated with a complex disease within a 2Mb genomic 
region for establishing that a candidate gene is, at least partially, responsible for a disease, the 
maps and markers of the present invention may also be used to identify one or more biallelic 
5 markers or one or more genes associated with other detectable phenotypes, including drug 
response, drug toxicity, or drug efficacy. The biallelic markers used in such drug response 
analyses or shown, using the methods of the present invention to be associated with such 
traits, may lie within or near genes responsible for or partly responsible for a particular 
disease, for example a disease against which the drug is meant to act, or may lie within 
10 genomic regions which are not responsible for or partly responsible for a disease. In the 
context of the present invention, a "positive response" to a medicament can be defined as 
comprising a reduction of the symptoms related to the disease or condition to be treated. In 
the context of the present invention, a "negative response" to a medicament can be defined as 
comprising either a lack of positive response to the medicament which does not lead to a 
1 5 symptom reduction or to a side-effect observed following administration of the medicament. 

Drug efficacy, response and tolerance/toxicity can be considered as multifactorial 
traits involving a genetic component in the same way as complex diseases such as Alzheimer's 
disease, prostate cancer, hypertension or diabetes. As such, the identification of genes 
involved in drug efficacy and toxicity could be achieved following a positional cloning 
20 approach, e.g. performing linkage analysis within families in order to obtain the 

subchromosomal location of the gene(s). However, this type of analysis is actually impractical 
in the case of drug responsiveness, due to the lack of availability of familial cases. In fact, the 
likelihood of having more than one individual in a particular family being exposed to the same 
drug at the same time is very low. Therefore, drug efficacy and toxicity can only be analyzed 
25 as sporadic traits. 

In order to conduct association studies to analyze the individual response to a given 
drug in groups of patients affected with a disease, up to four groups are screened to determine 
their patterns of biallelic markers using the techniques described above. The four groups are: 

- Non-diseased or random controls, 
30 - Diseased patients/drug responders, 

- Diseased patients/drug non-responders, and 

- Diseased patients/drug side effects. 

In preferred embodiments, the above mentioned groups are recruited according to 
phenotyping criteria having the characteristics described above, so that the phenotypes 
35 defining the different groups are non-overlapping, preferably extreme phenotypes. In highly 
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preferred embodiments, such phenotyping criteria have the bimodal distribution described 
above. 

The final number and composition of the groups for each drug association study is 
adapted to the distribution of the above described phenotypes within the studied population. 
5 After selecting a suitable population, association and haplotype analyses may be 

performed as described herein to identify one or more biallelic markers associated with drug 
response, preferably drug toxicity or drug efficacy. The identification of such one or more 
biallelic markers allows one to conduct diagnostic tests to determine whether the 
administration of a drug to an individual will result in drug response, preferably drug toxicity, 

10 or drug efficacy. 

The methods described above for identifying a gene associated with prostate cancer 
and biallelic markers indicative of a risk of suffering from asthma may be utilized to identify 
genes associated with other detectable phenotypes. In particular, the above methods may be 
used with any marker or combination of markers included in the maps of the present 

15 invention, including the biallelic markers of SEQ ID Nos.: 1 to 3809 or the sequences 

complementary thereto. As described above, the general strategy to perform the association 
studies using the maps and markers of the present invention is to scan two groups of 
individuals (trait-positive individuals and trait-negative controls) characterized by a well 
defined phenotype in order to measure the allele frequencies of the biallelic markers in each of 

20 these groups. Preferably, the frequencies of markers with inter-marker spacing of about 1 50 
kb are determined in each group. More preferably, the frequencies of markers with inter- 
marker spacing of about 75 kb are determined in each group. Even more preferably, markers 
with inter-marker spacing of about 50 kb, about 37.5kb, about 30kb, or about 25kb will be 
tested in each population. 

25 In some embodiments the frequences of t, 5, 10, 20, 50, 100, 500, 1000, 2000, 3000, 

or all of the biallelic markers of SEQ ID Nos.: 1 to 3908 or the sequences complementary 
thereto are measured in each population. In another embodiment, the frequencies of 1, 5, 10, 
20, 50, 100, 500, 1000, 2000, or 3000 biallelic markers selected from the group consisting of 
biallelic markers which are in linkage disequilibrium with the biallelic markers of 1 to 3908 or 

30 the sequences complementary thereto are measured in each population. In some embodiments 
the frequences of 1, 5, 10, 20, 50, 100, 500, 1000, 2000, or all of the biallelic markers of SEQ 
ID Nos.: 1 to 2260 or the sequences complementary thereto are measured in each population. 
In another embodiment, the frequencies of 1, 5, 10, 20, 50, 100, 500, 1000, or 2000 biallelic 
markers selected from the group consisting of biallelic markers which are in linkage 

35 disequilibrium with the biallelic markers of I to 2260 or the sequences complementary thereto 
arc measured in each population. In some embodiments the frequcnices of 1, 5, 10, 20, 50, 



WO 99/54500 PCT/1B99/00822 

92 

100, 500, 1000, or all of the biallelic markers of SEQ ID Nos.: 2261 to 3734 or the sequences 
complementary thereto are measured in each population. In another embodiment, the 
frequencies of I, 5, 10, 20, 50, 100, 500, 1000 biallelic markers selected from the group 
consisting of biallelic markers which are in linkage disequilibrium with the biallelic markers 
5 of 2261 to 3734 or the sequences complementary thereto are measured in each population. In 
some embodiments the frequences of 1, 5, 10, 20, 50, 100, or all of the biallelic markers of 
SEQ ID Nos.: 3735 to 3908 or the sequences complementary thereto are measured in each 
population. In another embodiment, the frequencies of 1, 5, 10, 20, 50, or 100 biallelic 
markers selected from the group consisting of biallelic markers which are in linkage 
10 disequilibrium with the biallelic markers of 3735 to 3908 or the sequences complementary 
thereto are measured in each population. 

In some embodiments, the frequencies of about 20,000, or about 40,000 biallelic 
markers are determined in each population. In a highly preferred embodiment, the frequencies 
of about 60,000, about 80,000, about 100,000, or about 120,000 biallelic markers are 
15 determined in each population. In some embodiments, haplotype analyses may be run using 
groups of markers located within regions spanning less than Ikb, from 1 to 5kb, from 5 to 
lOkb, from 10 to 25kb, from 25 to 50kb, from 50 to 150kb, from 150 to 250kb, from 250 to 
500kb, from 500kb to 1Mb, or more than 1Mb. 

Allele frequency can be measured using any genotyping method described herein 
20 including microsequencing techniques; preferred high throughput microsequencing procedures 
are further exemplified in III; it will be further appreciated that any other large scale 
genotyping method suitable with the intended purpose contemplated herein may also be used. 

It will be appreciated that it is not necessary to use a full high density biallelic marker 
map in order to start a genome-wide association study. Maps having higher densities of 
25 biallelic markers (two or more markers per BAG, average intcr-markcr spacing of about 75kb 
or less) may then be generated by starting first on those BACs for which a candidate 
association has been established at the first step. 

In cases when one or more candidate regions have previously been delineated, such as 
cases where a particular gene or genomic region is suspected of being associated with a trait, 
30 local excerpts of biallelic marker maps having densities above one marker per 150kb may be 
exploited using BACs harboring said genomic regions, or genes, or portions thereof. In these 
cases also, successive association studies may be performed using sets of biallelic markers 
showing increasing densities, preferably from about one every 150 kb to about one every 
75kb; more preferably, sets of markers with inter-marker spacing below about 50kb, below 
35 about 37.5kb, below about 30kb, most preferably below about 25 kb, will be used. 
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Haplotype analyses may also be conducted using groups of biallelic markers within 
the candidate region. The biallelic markers included in each of these groups may be located 
within a genomic region spanning less than lkb, from 1 to 5kb, from 5 to lOkb, from 10 to 25kb, 
from 25 to 50kb, from 50 to 150kb, from 150 to 250kb, from 250 to 500kb, from 500kb to 1Mb, 
5 or more than 1Mb. It will be appreciated that the ordered DNA fragments containing these 

groups of biallelic markers need not completely cover the genomic regions of these lengths but 
may instead be incomplete contigs having one or more gaps therein. As discussed in further 
detail below, biallelic markers may be used in association studies and haplotype analyses 
regardless of the completeness of the corresponding physical contig harboring them, provided 
10 linkage disequilibrium between the markers can be assessed. 

As described above, if a positive association with a trait, such as a disease, or a drug 
efficacy and/or toxicity, is identified using the biallelic markers and maps of the present 
invention, the maps will provide not only the confirmation of the association, but also a 
shortcut towards the identification of the gene involved in the trait under study. As described 
15 above, since the markers showing positive association to the trait are in linkage disequilibrium 
with the trait loci, the causal gene will be physically located in the vicinity of these markers. 
Regions identified through association studies using high density maps will on average have a 
20 - 40 times shorter length than those identified by linkage analysis (2 to 20 Mb). 

As described above, once a positive association is confirmed with the high density 
20 biallelic marker maps of the present invention, BACs from which the most highly associated 
markers were derived are completely sequenced and the mutations in the causal gene are 
searched by applying genomic analysis tools. As described above, once a region harboring a 
gene associated with a detectable trait has been sequenced and analyzed, the candidate 
functional regions (e.g. exons and splice sites, promoters and other regulatory regions) are 
25 scanned for mutations by comparing the sequences of a selected number of controls and cases, 
using adequate software. 

In some embodiments, trait-positive samples being compared to identify causal 
mutations are selected among those carrying the ancestral haplotype; in these embodiments, 
control samples are chosen from individuals not carrying said ancestral haplotype. 
30 In further embodiments, trait-positive samples being compared to identify causal 

mutations are selected among those showing haplotypes that are as close as possible to the 
ancestral haplotype; in these embodiments, control samples are chosen from individuals not 
carrying any of the haplotypes selected for the case population. 

The maps and biallelic markers of the present invention may also be used to identify 
35 patterns of biallelic markers associated with detectable traits resulting from polygenic 

interactions. The analysis of genetic interaction between alleles at unlinked loci requires 
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individual genotyping using the techniques described herein. The analysis of allelic interaction 
among a selected set of biallelic markers with appropriate p-values can be considered as a 
haptotype analysis, similar to those described in further details within the present invention. 

5 TX. ij S e of Biallelic Markers to Identify Individuals Likelv to Exhibit a Detectable Trait 
Associated with a Particular Allele of a Known Gene 

In addition to their utility in searches for genes associated with detectable traits on a 
genome-wide, chromosome-wide, or subchromosomal level, the maps and biallelic markers of 
the present invention may be used in more targeted approaches for identifying individuals likely 
1 0 to exhibit a particular detectable trait or individuals who exhibit a particular detectable trait as a 
consequence of possessing a particular allele of a gene associated with the detectable trait. For 
example, the biallelic markers and maps of the present invention may be used to identify 
individuals who carry an allele of a known gene that is suspected of being associated with a 
particular detectable trait. In particular the target genes may be genes having alleles which 
1 5 predispose an individual to suffer from a specific disease state. In other cases, the target genes 
may be genes having alleles that predispose an individual to exhibit a desired or undesired 
response to a drug or other pharmaceutical composition, a food, or any administered compound. 
The known gene may encode any of a variety of types of biomolcculcs. For example, the known 
genes targeted in such analyzes may be genes known to be involved in a particular step in a 
20 metabolic pathway in which disruptions may cause a detectable trait. Alternatively, the target 
genes may be genes encoding receptors or ligands which bind to receptors in which disruptions 
may cause a detectable trait, genes encoding transporters, genes encoding proteins with 
signaling activities, genes encoding proteins involved in the immune response, genes encoding 
proteins involved in hematopoesis, or genes encoding proteins involved in wound healing. It will 
25 be appreciated that the target genes are not limited to those specifically enumerated above, but 
may be any gene known to be or suspected of being associated with a detectable trait. 

As previously mentioned, the maps and markers of the present invention may be used 
to identify genes associated with drug response. The biallelic markers of the present invention 
may also be used to select individuals for inclusion in the clinical trials of a drug. In some 
30 embodiments, the markers of SEQ ID Nos.: 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908 
or the sequences complementary thereto may be used in targeted approaches to identify 
individuals at risk of developing a detectable trait, for example a complex disease or 
desired/undesired drug response, or to identify individuals exhibiting said trait. The present 
invention provides methods to establish putative associations between any of the biallelic 
35 markers described herein and any detectable traits, including those specifically described herein. 
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To use the maps and markers of the present invention in further targeted approaches, 
biallelic markers which are in linkage disequilibrium with any of the above disclosed markers 
may be identified. In cases where one or more biallelic markers of the present invention have 
been shown to be associated with a detectable trait, more biallelic markers in linkage 

5 disequilibrium with said associated biallelic markers may be generated and used to perform 
targeted approaches aiming at identifying individuals exhibiting, or likely to exhibit, said 
detectable trait, according to the methods provided herein. 

Furthermore, in cases where a candidate gene is suspected of being associated with a 
particular detectable trait or suspected of causing the detectable trait, biallelic markers in linkage 

10 disequilibrium with said candidate gene may be identified and used in targeted approaches, such 
as the approaches utilized above for the asthma-associated gene and the Apo E gene. 

Biallelic markers that are in linkage disequilibrium with markers associated with a 
detectable trait, or with genes associated with a detectable trait, or suspected of being so, are 
identified by performing single marker analyzes, haplotype association analyzes, or linkage 

15 disequilibrium measurements on samples from trait-positive and trait-negative individuals as 

described above using biallelic markers lying in the vicinity of the target marker or gene. In this 
manner, a single biallelic marker or a group of biallelic markers may be identified which indicate 
that an individual is likely to possess the detectable trait or does possess the detectable trait as a 
consequence of a particular allele of the target marker or gene. 

20 Nucleic acid samples from individuals to be tested for predisposition to a detectable trait 

or possession of a detectable trait as a consequence of a particular allele of the target gene may 
be examined using the diagnostic methods described above. 

Throughout this application, various publications, patents, and published patent 
applications are cited The disclosures of the publications, patents, and published patent 

25 specifications referenced in this application are hereby incorporated by reference into the 

present disclosure to more fully desenbs the state of the art to which this invention pertains. 

EXAMPLES 

Several of the methods of the present invention are described in the following 
30 examples, which are offered by way of illustration and not by way of limitation. Many other 
modifications and variations of the invention as herein set forth can be made without 
departing from the spirit and scope thereof and therefore only such limitations should be 
imposed as are indicated by the appended claims. 
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Example 1 

Ordering of a BAC Librar y: Screening Clones with STSs 
The BAC library is screened with a set of PCR-typeable STSs to identify clones 
containing the STSs. To facilitate PCR screening of several thousand clones, for example 
5 200,000 clones, pools of clones are prepared. 

Three-dimensional pools of the BAC libraries are prepared as described in Chumakov 
etal. and are screened for the ability to generate an amplification fragment in amplification 
reactions conducted using primers derived from the ordered STSs. (Chumakov et al. (1995), 
supra). A BAC library typically contains 200,000 BAC clones. Since the average size of each 
10 insert is 100-300 kb, the overall size of such a library is equivalent to the size of at least about 
7 human genomes. This library is stored as an array of individual clones in 5 18 384-well 
plates. It can be divided into 74 primary pools (7 plates each). Each primary pool can then be 
divided into 48 subpools prepared by using a three-dimensional pooling system based on the 
plate, row and column address of each clone (more particularly, 7 subpools consisting of all 
1 5 clones residing in a given microliter plate; 1 6 subpools consisting of all clones in a given row; 
24 subpools consisting of all clones in a given column). 

Amplification reactions are conducted on the pooled BAC clones using primers 
specific for the STSs. For example, the three dimensional pools may be screened with 45,000 
STSs whose positions relative to one another and locations along the genome are known. 
20 Preferably, the three dimensional pools are screened with about 30,000 STSs whose positions 
relative to one another and locations along the genome are known. In a highly preferred 
embodiment, the three dimensional pools are screened with about 20,000 STSs whose 
positions relative to one another and locations along the genome are known. 

Amplification products resulting from the amplification reactions are detected by 
25 conventional agarose gel electrophoresis combined with automatic image capturing and 

processing. PCR screening for a STS involves three steps: (1) identifying the positive primary 
pools; (2) for each positive primary pool, identifying the positive plate, row and column 
'subpools* to obtain the address of the positive clone; (3) directly confirming the PCR assay 
on the identified clone. PCR assays are performed with primers specifically defining the STS. 
30 Screening is conducted as follows. First BAC DNA containing the genomic inserts is 

prepared as follows. Bacteria containing the BACs are grown overnight at 37°C in 120 ul of 
LB containing chloramphenicol (12 ng/ml). DNA is extracted by the following protocol: 
Centrifuge 10 min at 4°C and 2000 rpm 

Eliminate supernatant and resuspend pellet in 120 j^l TE 10-2 (Tris HCI 10 mM, 
35 EDTA 2 mM) 

Centrifuge 10 min at 4°C and 2000 rpm 
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Eliminate supernatant and incubate pellet with 20 ul lyzozyme 1 mg/ml during 15 min 
at room temperature 

Add 20 pi proteinase K lOOpg/ml and incubate 15 min at 60°C 
Add 8 ul DNAse 2U/ul and incubate 1 hr at room temperature 
5 Add 1 00 ul TE 1 0-2 and keep at -80°C 

PCR assays are performed using the following protocol: 

Final volume 15 ul 

BAC DNA 17n ^' 
10 MgCl 2 2mM 

dNTP(each) 200 ^ M 

primer (each) 2 - 9n ^' 
Ampli Taq Gold DNA polymerase 0 05 unit/ul 

PCRbuffer(10x = 0.1MTrisHCl P H8.3 0.5MKCl lx 

15 

The amplification is performed on a Genius II thermocycler. After heating at 95°C for 10 min, 
40 cycles are performed. Each cycle comprises: 30 sec at 95°C, 54°C for 1 min, and 30 sec at 
72°C. For final elongation, 10 min at 72°C end the amplification. PCR products are analyzed 
on 1% agarose gel with 0.1 mg/ml ethidium bromide. 
20 Alternatively, a YAC (Yeast Artificial Chromosome) library can be used. The very 

large insert size, of the order of 1 megabase, is the main advantage of the YAC libraries. The 
library can typically include about 33,000 YAC clones as described in Chumakov et al. (1995, 
supra). The YAC screening protocol may be the same as the one used for BAC screening. 

The known order of the STSs is then used to align the BAC inserts in an ordered array 
25 (contig) spanning the whole human genome. If necessary new STSs to be tested can be 

generated by sequencing the ends of selected BAC inserts. Subchromosomal localizat.on of 
the BACs can be established and'or verified by fluorescence in situ hybridization (FISH), 
performed on metaphasic chromosomes as described by Cherif et al. 1990 and in Example 3 
below. BAC insert size may be determined by Pulsed Field Gel Electrophoresis after 
30 digestion with the restriction enzyme Notl. 

Finally, a minimally overlapping set of BAC clones, with known insert size and 
subchromosomal location, covering the entire genome, a set of chromosomes, a single 
chromosome, a particular subchromosomal region, or any other desired portion of the genome 
is selected from the DNA library. For example, the BAC clones may cover at least lOOkb of 
35 contiguous genomic DNA, at least 250kb of contiguous genomic DNA, at least 500kb of 
contiguous genomic DNA, at least 2Mb of contiguous genomic DNA, at least 5Mb of 



WO 99/54500 PCT/1B99/00822 

98 

contiguous genomic DNA, at least 10Mb of contiguous genomic DNA, or at least 20Mb of 
contiguous genomic DNA. 

Example 2 

Screening BAC libraries with biallelic markers 
5 Amplification primers enabling the specific amplification of DNA fragments carrying 

the biallelic markers, including the map-related biallelic markers of the invention, may be used to 
screen clones in any genomic DNA library, preferably the BAC libraries described above for the 
presence of the biallelic markers. 

Pairs of primers of SEQ ID Nos: 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 
10 7842, 7866 to 11773, 7866 to 10125, 10126 to 11599, and 11600 to 11773 were designed which 
allow the amplification of fragments carrying the biallelic markers of SEQ ID Nos: 1 to 3908, 1 
to 2260, 2261 to 3374, 3735 to 3908 or the sequences complementary thereto. The amplification 
primers of SEQ ID Nos: 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 
11773,7866 to 10125, 10126 to 11 599, and 11600 to 1 1 773 may be used to screen clones ina 
15 genomic DNA library for the presence of the biallelic markers of SEQ ID Nos: 1 to 3908, 1 to 
2260, 2261 to 3374, 3735 to 3908 or the sequences complementary thereto. 

It will be appreciated that amplification primers for the biallelic markers of SEQ ID Nos: 
1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908 need not be identical to the primers of SEQ ID 
Nos: 3935 to 7842, 3935 to 6194, 6195 to 7668, 7669 to 7842, 7866 to 1 1773, 7866 to 10125, 
20 10126 to 1 1599, and 1 1600 to 1 1773. Rather, they can be any other primers allowing the 

specific amplification of any DNA fragment carrying the markers and may be designed using 
techniques familiar to those skilled in the art. The amplification primers maybe 
oligonucleotides of 8, 10, 15, 20 or more bases in length which enable the amplification of any 
fragment carrying the polymorphic site in the markers. The polymorphic base may be in the 
25 center of the amplification product or, alternatively, it may be located off-center. For 

example, in some embodiments, the amplification product produced using these primers may 
be at least 100 bases in length (i.e. 50 nucleotides on each side of the polymorphic base in 
amplification products in which the polymorphic base is centrally located). In other 
embodiments, the amplification product produced using these primers may be at least 500 
30 bases in length (i.e. 250 nucleotides on each side of the polymorphic base in amplification 

products in which the polymorphic base is centrally located). In still further embodiments, the 
amplification product produced using these primers may be at least 1000 bases in length (i.e. 
500 nucleotides on each side of the polymorphic base in amplification products in which the 
polymorphic base is centrally located). Amplification primers such as those described above 
35 are included within the scope of the present invention. 

The localization of biallelic markers on BAC clones is performed essentially as 
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described in Example 1 . 

The BAC clones to be screened are distributed in three dimensional pools as described 

in Example 1. 

Amplification reactions are conducted on the pooled BAC clones using primers 
5 specific for the biallelic markers to identify BAC clones which contain the biallelic markers, 
using procedures essentially similar to those described in Example L 

Amplification products resulting from the amplification reactions are detected by 
conventional agarose gel electrophoresis combined with automatic image capturing and 
processing. PCR screening for a biallelic marker involves three steps: (1) identifying the 
10 positive primary pools; (2) for each positive primary pools, identifying the positive plate, row 
and column 'subpools' to obtain the address of the positive clone; (3) directly confirming the 
PCR assay on the identified clone. PCR assays are performed with primers defining the 
biallelic marker. 

Screening is conducted as follows. First BAC DNA is isolated as follows. Bacteria 
15 containing the genomic inserts are grown overnight at 37°C in 120 |il of LB containing 
chloramphenicol (12 ug/ml). DNA is extracted by the following protocol: 
Centrifuge 10 min at 4°C and 2000 rpm 

Eliminate supernatant and resuspend pellet in 120 ul TE 10-2 (Tris HC1 10 mM, 
EDTA 2 mM) 

20 Centrifuge 10 min at 4°C and 2000 rpm 

Eliminate supernatant and incubate pellet with 20 ul lyzozyme 1 mg/ml during 15 min 
at room temperature 

Add 20 ul proteinase K lOOug/ml and incubate 15 min at 60°C 
Add 8 ul DNAse 2U/u1 and incubate 1 hr at room temperature 
25 Add 100 pi TE 10-2 and keep at-80°C 

PCR assays are performed using the following protocol: 

Final volume 15 

BAC DNA 1 - 7n ^ 1 

30 MgCl, 2mM 

dNTP(cach) 200 \iM 

primer (each) 2 - 9n ^' 
Ampli Taq Gold DNA polymerase 0.05 uni^l 

PCR buffer (lOx = 0.1 M TrisHCl P H8.3 0.5M KC1 lx 
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The amplification is performed on a Genius II thermocycler. After heating at 95°C for 
10 min, 40 cycles are performed. Each cycle comprises: 30 sec at 95°C, 54°C for 1 min, and 
30 sec at 72°C. For final elongation, 10 min at 72°C end the amplification. PCR products are 
analyzed on 1% agarose gel with 0.1 mg/ml ethidium bromide. 
5 Example 3 

Assignment of Biallelic Markers to Subchromosomal Regions 
Metaphase chromosomes are prepared from phytohemagglutinin (PHA)-stimulated 
blood cell donors. PHA-stimulated lymphocytes from healthy males are cultured for 72 h in 
RPMI-1 640 medium. For synchronization, methotrexate (10 mM) is added for 1 7 h, followed by 
10 addition of 5-bromodeoxyuridine (5-BudR, 0. 1 mM) for 6 h. Colcemid (1 mg/ml) is added for 
the last 15 min before harvesting the cells. Cells are collected, washed in RPMI, incubated with 
a hypotonic solution of KC1 (75 mM) at 37°C for 15 min and fixed in three changes of 
methanokacetic acid (3:1). The cell suspension is dropped onto a glass slide and air-dricd. 

BAC clones carrying the biallelic markers used to construct the maps of the present 
15 invention (including the biallelic markers of SEQ ID Nos: I to 3908, 1 to 2260, 2261 to 3374, 

3735 to 3908 or the sequences complementary thereto) can be isolated as described above. These 
BACs or portions thereof, including fragments carrying said biallelic markers, obtained for 
example from amplification reactions using pairs of primers of SEQ ID Nos: 3935 to 7842, 3935 
to 6194, 6195 to 7668, 7669 to 7842, 7866 to 1 1773, 7866 to 10125, 10126 to 1 1599, and 1 1600 
20 to 1 1 773, can be used as probes to be hybridized with metaphasic chromosomes. It will be 
appreciated that the hybridization probes to be used in the contemplated method may be 
generated using alternative methods well known to those skilled in the art Hybridization probes 
may have any length suitable for this intended purpose. 

Probes are then labeled with biotin-1 6 dUTP by nick translation according to the 
25 manufacturer's instructions (Bethesda Research Laboratories, Bethcsda, MD), purified using a 
Sephadex G-50 column (Pharmacia, Upssala, Sweden) and precipitated. Just prior to 
hybridization, the DNA pellet is dissolved in hybridization buffer (50% formamide, 2 X SSC, 
10% dextran sulfate, 1 mg/ml sonicated salmon sperm DNA, P H 7) and the probe is denatured at 
70°C for 5-10 min. 

30 Slides kept at -20°C are treated for 1 h at 37*C with RNase A (100 mg/ml), rinsed three 

times in 2 X SSC and dehydrated in an ethanol series. Chromosome preparations are denatured 
in 70% formamide, 2 X SSC for 2 min at 70°C, then dehydrated at 4°C. The slides are treated 
with proteinase K (10 mg/100 ml in 20 mM Tris-HCl, 2 mM CaCl 2 ) at 37°C for 8 min and 
dehydrated. The hybridization mixture containing the probe is placed on the slide, covered with 

35 a covcrslip, sealed with rubber cement and incubated overnight in a humid chamber at 37°C. 

After hybridization and post-hybridization washes, the biotinylatcd probe is detected by avidin- 
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FITC and amplified with additional layers of biotinylated goat anti-avidin and avidin-FITC. For 
chromosomal localization, fluorescent R-bands are obtained as previously described (Cherif et 
al.,(1990)sM/7ra). The slides are observed under a LEICA fluorescence microscope (DMRXA). 
Chromosomes are counterstained with propidium iodide and the fluorescent signal of the probe 
5 appears as two symmetrical yellow-green spots on both chromatids of the fluorescent R-band 
chromosome (red). Thus, a particular biallelic marker may be localized to a particular 
cytogenetic R-band on a given chromosome. 

The above procedure was used to confirm the subchromosomal location of many of the 
BAC clones harboring the markers obtained above. In particular, several of the markers were 
1 0. assigned to subchromosomal regions of chromosome 2 1 . Simple identification numbers were 
attributed to each BAC from which the markers are derived. Figure 1 is a cytogenetic map of 
chromosome 21 indicating the subchromosomal regions therein. Amplification primers for 
generating amplification products containing the polymorphic bases of these markers are also 
provided in the accompanying sequence listing. In addition, microscquencing primers for use in 
1 5 determining the identities of the polymorphic bases of these biallelic markers are provided in the 
accompanying Sequence Listing. 

The rate at which biallelic markers may be assigned to subchromosomal regions may be 
enhanced through automation. For example, probe preparation may be performed in a microtiter 
plate format, using adequate robots. The rate at which biallelic markers may be assigned to 
20 subchromosomal regions may be enhanced using techniques which permit the in situ 

hybridization of multiple probes on a single microscope slide, such as those disclosed in Larin et 
al., Nucleic Acids Research 22: 3689-3692 (1994). In the largest test format described, different 
probes were hybridized simultaneously by applying them directly from a 96-well microtiter dish 
which was inverted on a glass plate. Software for image data acquisition and analysis that is 
25 adapted to each optical system, test format, and fluorescent probe used, can be derived from the 
system described in Lichter et al. Science 247: 64-69 (1990). Such software measures the 
relative distance between the center of the fluorescent spot corresponding to the hybridized probe 
and the telomeric end of the short arm of the corresponding chromosome, as compared to the 
total length of the chromosome. The rate at which biallelic markers are assigned to 
30 subchromosomal locations may be further enhanced by simultaneously applying probes labeled 
with different flouorescent tags to each well of the 96 well dish. A further benefit of conducting 
the analysis on one slide is that it facilitates automation, since a microscope having a moving 
stage and the capability of detecting fluorescent signals in different metaphase chromosomes 
could provide the coordinates of each probe on the metaphase chromosomes distributed on the 
35 96 well dish. 

Example 4 below describes an alternative method to position biallelic markers which 
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allows their assignment to human chromosomes. 

Example 4 

Assignment of Biallelic Markers to Human Chromosomes 
The biallelic markers used to construct the maps of the present invention, including 
5 the biallelic markers of SEQ ID Nos. 1 to 3908, 1 to 2260, 2261 to 3374, 3735 to 3908 or the 
sequences complementary thereto, may be assigned to a human chromosome using monosomal 
analysis as described below. 

The chromosomal localization of a biallelic marker can be performed through the use 
of somatic cell hybrid panels. For example 24 panels, each panel containing a different 
1 0 human chromosome, may be used (Russell et al., Somat Cell Mol. Genet 22:425^3 1 (1996); 
Drwinga et al., Genomics 16:311-314 (1993)). 

The biallelic markers are localized as follows. The DNA of each somatic cell hybrid 
is extracted and purified. Genomic DNA samples from a somatic cell hybrid panel are 
prepared as follows. Cells are lysed overnight at 42 0 C_mlh 3.7 ml of lysis solution composed 
15 of: 

3 ml TE 10-2 (Tris HC1 10 mM, EDTA 2 mM) / NaCl 0.4 M 
200 pi SDS10% 

500 pi K-proteinase (2 mg K-proteinase in TE 10-2 / NaCl 0.4 M) 
For the extraction of proteins, 1 ml saturated NaCl (6M) (1/3.5 v/v) is added. After 
20 vigorous agitation, the solution is centrifuged for 20 min at 1 0,000 rpm. For the precipitation 
of DNA, 2 to 3 volumes of 100 % ethanol are added to the previous supernatant, and the 
solution is centrifuged for 30 min at 2,000 rpm. The DNA solution is rinsed three times with 
70 % ethanol to eliminate salts, and centrifuged for 20 min at 2,000 rpm. The pellet is dried at 
37°C, and resuspended in 1 ml TE 1 0-1 or 1 ml water. The DNA concentration is evaluated by 
25 measuring the OD at 260 nm (1 unit OD = 50 ug/ml DNA). To determine the presence of 
proteins in the DNA solution, the ODWOD 260 ratio is determined. Only DNA preparations 
having a OD 26( /OD 2 go ratio between 1 .8 and 2 are used in the PCR assay. 

Then, a PCR assay is performed on genomic DNA with primers defining the biallelic 
marker. The PCR assay is performed as described above for BAC screening. The PCR 
30 products are analyzed on a 1 % agarose gel containing 0.2 mg/ml ethidium bromide. 

Example 5 
Measurement of Linkage Diseq uilibrium 
As originally reported by Strittmatter et al. and by Saunders et al. in 1993, the Apo E 
e4 allele is strongly associated with both late-onset familial and sporadic Alzheimer's disease. 
35 (Saunders, A.M. Lancet 342: 710-71 1 (1993) and Strittmater, W.J. et al., Proc. Natl. Acad. 

Sci. U.S.A. 90: 1977-1981 (1993)). The 3 major isoforms of human Apolipoprotein E (apoE2, 
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-E3, and -E4), as identified by isoelectric focusing, are coded for by 3 alleles (e 2, 3, and 4). 
The e 2, e 3, and e 4 isoforms differ in amino acid sequence at 2 sites, residue 1 12 (called site 
A) and residue 158 (called site B). The ancestral isoform of the protein is Apo E3, which at 
sites A/B contains cysteine/arginine. while ApoE2 and -E4 contain cysteine/cysteine and 
5 arginine/arginine, respectively (Weisgraber, K.H. et al., J. Biol. Chem. 256: 9077-9083 
(1981); Rail, S.C. et al., Proc. Natl. Acad. Sci. U.S.A. 79: 4696^700 (1982)). 

Apo E e 4 is currently considered as a major susceptibility risk factor for Alzheimer's 
disease development in individuals of different ethnic groups (specially in Caucasians and 
Japanese compared to Hispanics or African Americans), across all ages between 40 and 90 
1 0 years, and in both men and women, as reported recently in a study performed on 5930 

Alzheimer's disease patients and 8607 controls (Farreret al., JAMA 278:1349-1356 (1997)). 
More specifically, the frequency of a C base coding for arginine 1 12 at site A is significantly 
increased in Alzheimer's disease patients. 

Although the mechanistic link between Apo E e 4 and neuronal degeneration 
15 characteristic of Alzheimer's disease remains to be established, current hypotheses suggest 
that the Apo E genotype may influence neuronal vulnerability by increasing the deposition 
and/or aggregation of the amyloid beta peptide in the brain or by indirectly reducing energy 
availability to neurons by promoting atherosclerosis. 

Using the methods of the present invention, biallelic markers that are in the vicinity of 
20 the Apo E site A were generated and the association of one of their alleles with Alzheimer's 
disease was analyzed. An Apo E public marker (stSG94) was used to screen a human genome 
BAC library as previously described. A BAC, which gave a unique FISH hybridization signal 
on chromosomal region 19ql3.2.3, the chromosomal region harboring the Apo E gene, was 
selected for finding biallelic markers in linkage disequilibrium with the Apo E gene as 
25 follows. 

This BAC contained an insert of 205 kb that was subcloned as previously described. 
Fifty BAC subclones were randomly selected and sequenced. Twenty five subclone sequences 
were selected and used to design twenty five pairs of PCR primers allowing 500 bp-amplicons 
to be generated. These PCR primers were then used to amplify the corresponding genomic 
30 sequences in a pool of DNA from 100 unrelated individuals (blood donors of French origin) as 
already described. 

Amplification products from pooled DNA were sequenced and analyzed for the 
presence of biallelic polymorphisms, as already described. Five amplicons were shown to 
contain a polymorphic base in the pool of 100 unrelated individuals, and therefore these 
35 polymorphisms were selected as random biallelic markers in the vicinity of the Apo E gene. 

The sequences of both alleles of these biallelic markers (99-344-439; 99-366-274, 99-359-308; 
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99-355-219; 99-365-344; ) correspond to SEQ ID Nos: 3909 to 3913. Corresponding pairs of 
amplification primers for generating amplicons containing these biallelic markers can be 
chosen from those listed as SEQ ID Nos: 7843 to 7847 and 1 1774 to 1 1778. 

An additional pair of primers (SEQ ID Nos: 3 124 and 4169) was designed that allows 
5 amplification of the genomic fragment carrying the biallelic polymorphism corresponding to 
the ApoE marker (99-2452-54; GT; designated SEQ ID NO: 3914 in the accompanying 
Sequence Listing; publicly known as Apo E site A (Weisgraber et al. (1981), supra; Rail et al. 
(1982), supra) to be amplified. 

The five random biallelic markers plus the Apo E site A marker were physically 
1 0 ordered by PCR screening of the corresponding amplicons using all available BACs originally 
selected from the genomic DNA libraries, as previously described, using the public Apo E 
marker stSG94. The amplicon's order derived from this BAC screening is as follows: (99- 
344^39/99-366-274) - (99-365-344/99-2452-54) - 99-359-308 - 99-355-219, where 
parentheses indicate that the exact order of the respective amplicons couldn't be established. 
1 5 Linkage disequilibrium among the six biallelic markers (five random markers plus the 

Apo E site A) was determined by genotyping the same 100 unrelated individuals from whom 
the random biallelic markers were identified. 

DNA samples and amplification products from genomic PCR were obtained in similar 
conditions as those described above for the generation of biallelic markers, and subjected to 
20 automated microsequencing reactions using fluorescent ddNTPs (specific fluorescence for 
each ddNTP) and the appropriate microsequencing primers having a 3* end immediately 
upstream of the polymorphic base in the biallelic markers. Once specifically extended at the 3' 
end by a DNA polymerase using the complementary fluorescent dideoxynucleotide analog 
(thermal cycling), the microsequencing primer was precipitated to remove the unincorporated 
25 fluorescent ddNTPs. The reaction products were analyzed by electrophoresis on ABI 377 

sequencing machines. Results were automatically analyzed by an appropriate software further 
described in Example 8. 

Linkage disequilibrium (LD) between all pairs of biallelic markers (Mi, Mj) was 
calculated for every allele combination (Mil,Mj 1 ; Mil,Mj2 ; Mi2,Mjl ; Mi2,Mj2) according 
30 to the maximum likelihood estimate (MLE) for delta (the composite linkage disequilibrium 
coefficient). The results of the linkage disequilibrium analysis between the Apo E Site A 
marker and the five new biallelic markers (99-344-439 ; 99-355-219 ; 99-359-308 ; 99-365- 
344 ; 99-366-274) are summarized in Table 2 below: 
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Table 2 





Markers 


dxlOO 


SEQ ID Nos of the 


SEQ ID Nos of the 








biallelic Markers 


amplification Primers 


5 


ApoE SiteA 


1028 


3124 






99-2452-54 


2076 


4169 






99.344-439 


1 


1023 


3119 






2071 


4164 






99-366-274 


1 


1024 


3120 


10 




2072 


4165 






99-365-344 


8 


1027 


3123 






2075 


4168 






99-359-308 


2 


1025 


3121 






2073 


4166 




15 


j 99-355-219 


1 


1026 


3122 






2074 


4167 . 





The above linkage disequilibrium results indicate that among the five biallelic markers 
randomly selected in a region of about 200 kb containing the Apo E gene, marker 99-365- 
20 344T is in relatively strong linkage disequilibrium with the Apo "E site A allele (99-2452-54C). 

Therefore, since the Apo E site A allele is associated with Alzheimer's disease, one 
can predict that the T allele of marker 99-365-344 will probably be found associated with 
Alzheimer's disease. In order to test this hypothesis, the biallelic markers of SEQ ID Nos: 
3909 to 3913 were used in association studies as described below. 
25 225 Alzheimer's disease patients were recruited according to clinical inclusion criteria 

based on the MMSE test. The 248 control cases included in this study were both ethnically- 
and age-matched to the affected cases. Both affected and contTol individuals corresponded to 
unrelated cases. The identities of the polymorphic bases of each of the biallelic markers was 
determined in each of these individuals using the methods described above. Techniques for 
30 conducting association studies are further described below. 

The results of this study are summarized in Table 3 below : 



35 
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Table 3 





MARKER 


ASSOCIATION DATA 


5 




Difference in allele frequency 
between individuals with Alzheimer's 
and control individuals 


Corresponding p-value 




99-344^39 


3.3 % 


9.54 E-02 




99-366-274 


1.6% 


2.09 E-01 




99-365-344 


17.7% 


6.9 E-10 




99-2452-54 (ApoE Site A) 23.8 % 


3.95 E-21 


10 


99-359-308 


0.4% 


9.2 E-01 




99-355-219 


2.5% 


2.54 E-01 



The frequency of the Apo E site A allele in both Alzheimer's disease cases and 
controls was found in agreement with that previously reported (ca. 10% in controls and ca. 
15 34% in Alzheimer's disease cases, leading to a 24% difference in allele frequency), thus 
validating the Apo E e4 association in the populations used for this study. 

Moreover, as predicted from the linkage disequilibrium analysis (Table 3), a 
significant association of the T allele of marker 99-365/344 with Alzheimer's disease cases 
(18% increase in the T allele frequency in Alzheimer's disease cases compared to controls, p 
20 value for this difference = 6.9 E-10) was observed. 

The above results indicate that any marker in linkage disequilibrium with one given 
marker associated with a trait will be associated with the trait. It will be appreciated that, 
though in this case the ApoE Site A marker is the trait-causing allele (TCA) itself, the same 
conclusion could be drawn with any other non trait-causing allele marker associated with the 
25 studied trait. 

These results further indicate that conducting association studies with a set of biallelic 
markers randomly generated within a candidate region at a sufficient density (here about one 
biallelic marker every 40kb on average), allows the identification of at least one marker 
associated with the trait. 

30 In addition, these results correlate with the physical order of the six biallelic markers 

contemplated within the present example (see above) : marker 99-365/344, which had been 
found to be the closest in terms of physical distance to the ApoE Site A marker, also shows 
the strongest linkage disequilibrium with the Apo E site A marker. 

In order to further refine the relationship between physical distance and linkage 

35 disequilibrium between biallelic markers, a ca. 450 kb fragment from a genomic region on 
chromosome 8 was fully sequenced. 
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LD within ca. 230 pairs of biallelic markers derived therefrom was measured in a 
random French population and analyzed as a function of the known physical inter-marker 
spacing. This analysis confirmed that, on average, linkage disequilibrium between 2 biallelic 
markers correlates with the physical distance that separates them. It further indicated that 

5 linkage disequilibrium between 2 biallelic markers tends to decrease when their spacing 
increases. More particularly, linkage disequilibrium between 2 biallelic markers tends to 
decrease when their inter-marker distance is greater than 50kb, and is further decreased when 
the inter-marker distance is greater than 75kb. It was further observed that when 2 biallelic 
markers were further than 150kb apart, most often no significant linkage disequilibrium 

10 between them could be evidenced. It will be appreciated that the size and history of the 

sample population used to measure linkage disequilibrium between markers may influence the 
distance beyond which linkage disequilibrium tends not to be detectable. 
Assuming that linkage disequilibrium can be measured between markers spanning regions up 
to an average of 150kb long, biallelic marker maps will allow genome-wide linkage 

15 disequilibrium mapping, provided they have an average inter-marker distance lower than 
150kb. 

Example 6 

Identification of a Candidate Region Harboring a 
Gene Associated with a Detectable Trait 

20 The initial identification of a candidate genomic region harboring a gene associated 

with a detectable trait may be conducted using a genome-wide map comprising about 20,000 
biallelic markers. The candidate genomic region may be further defined using a map having a 
higher marker density, such as a map comprising about 40,000 markers, about 60,000 markers, 
about 80,000 markers, about 100,000 markers, or about 120,000 markers. 

25 The use of high density maps such as those described above allows the identification 

of genes which are truly associated with detectable traits, since the coincidental associations 
will be randomly distributed along the genome while the true associations will map within one 
or more discrete genomic regions. Accordingly, biallelic markers located in the vicinity of a 
gene associated with a detectable trait will give rise to broad peaks in graphs plotting the 

30 frequencies of the biallelic markers in trait-positive individuals versus control individuals. In 
contrast, biallelic markers which are not in the vicinity of the gene associated with the 
detectable trait will produce unique points in such a plot. By determining the association of 
several markers within the region containing the gene associated with the detectable trait, the 
gene associated with the detectable trait can be identified using an association curve which 
35 reflects the difference between the allele frequencies within the trait-positive and control 
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populations for each studied marker. The gene associated with the detectable trait will be 
found in the vicinity of the marker showing the highest association with the trait. 

Figures 4, 5, and 6 provide a simulated illustration of the above principles. As 
illustrated in Figure 4, an association analysis conducted with a map comprising about 3,000 

5 biallelic markers yields a group of points. However, when an association analysis is performed 
using a denser map which includes additional biallelic markers, the points become broad 
peaks indicative of the location of a gene associated with a detectable trait. For example, the 
biallelic markers used in the initial association analysis may be obtained from a map 
comprising about 20,000 biallelic markers, as illustrated by the simulation results shown in 

10 Figure 5. In some embodiments, one or more of the biallelic markers of SEQ ID Nos. 1 to 

3908, 1 to 2260, 2261 to 3374, 3735 to 3908 or the sequences complementary thereto are used 
in the association analysis. 

In the simulated results of Figure 4, the association analysis with 3,000 markers 
suggests peaks near markers 9 and 17. 

15 Next, a second analysis is performed using additional markers in the vicinity of 

markers 9 and 17, as illustrated in the simulated results of Figure 5, using a map of about 
20,000 markers. This step again indicates an association in the close vicinity of marker 17, 
since more markers in this region show an association with the trait. However, none of the 
additional markers around marker 9 shows a significant association with the trait, which 

20 makes marker 9 a potential false positive. In some embodiments, one or more of the biallelic 
markers selected from the group consisting of SEQ ID Nos. I to 3908, 1 to 2260, 2261 to 
3374, 3735 to 3908 or the sequences complementary thereto are used in the second analysis. 
In order to further test the validity of these two suspected associations, a third analysis may be 
obtained with a map comprising about 60,000 biallelic markers. In some embodiments, one or 

25 more of the biallelic markers selected from the group consisting of SEQ ID Nos: I to 3908, 1 
to 2260, 2261 to 3374, 3735 to 3908 or the sequences complementary thereto are used in the 
third association analysis. In the simulated results of Figure 6, more markers lying around 
marker 17 exhibit a high degree of association with the detectable trait. Conversely, no 
association is confirmed in the vicinity of marker 9. The genomic region surrounding marker 

30 17 can thus be considered a candidate region for the potential trait of this simulation. 

Example 7 

Haplotype Analysis: Identification of biallelic markers delineating 
a genomic region associated with Alzheimer's Disease (AD) 
As shown in Table 3 within Example 5, at an average map density of one marker per 
35 40 kb only one marker (99-365-344) out of five random biallelic markers from a ca. 200 kb 
genomic region around the Apo E gene showed a clear association to Alzheimer's disease 
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(delta allelic frequency in cases and controls =18% ; p value = 6.9 E-10). The allelic 
frequencies of the other four random markers were not significantly different between 
Alzheimer's disease cases and controls (p-values > E-01). However, since linkage 
disequilibrium can usually be detected between markers located further apart than an average 

5 40 kb as previously discussed, one should expect that, performing an association study with a 
local excerpt of a biallelic marker map covering ca. 200kb with an average inter-marker 
distance of ca. 40kb should allow the identification of more than one biallelic marker 
associated with Alzheimer's disease. 

A haplotype analysis was thus performed using the biallelic markers 99-344^39; 99- 

10 355-219; 99-359-308; 99-365-344; and 99-366-274 (of SEQ ID Nos: 3909 to 3919). 

In a first step, marker 99-365-344 that was already found associated with Alzheimer's 
disease was not included in the haplotype study. Only biallelic markers 99-344-439, 99-355- 
219, 99-359-308, and 99-366-274, which did not show any significant association with 
Alzheimer's disease when taken individually, were used. This first haplotype analysis 

15 measured frequencies of all possible two-, three-, or four-marker haplotypes in the 

Alzheimer's disease case and control populations. As shown in Figure 7, there was one 
haplotype among all the potential different haplotypes based on the four individually non- 
significant markers ("haplotype 8", TAGG comprising SEQ ID No. 3910 with the T allele of 
marker 99-366-274, SEQ ID No. 3909 with the A allele of marker 99-344^*39, SEQ ID No. 

20 391 1 with the G allele of marker 99-359-308 and SEQ ID No. 3912 which is the G allele of 
marker 99-355-219), that was present at statistically significant different frequencies in the 
Alzheimer's disease case and control populations (D=12% ; p value = 2.05 E-06). Moreover, a 
significant difference was already observed for a three -marker haplotype included in the above 
mentioned "haplotype 8" ("haplotype T\ TGG, D-10% ; p value = 4.76 E-05). Haplotype 7 

25 comprises SEQ ID No. 3910 with the T allele of marker 99-366-274, SEQ ID No. 391 1 with 
the G allele of marker 99-359-308 and SEQ ID No. 3912 with the G allele of marker 99-355- 
219). The haplotype association analysis thus clearly increased the statistical power of the 
individual marker association studies by more than four orders of magnitude when compared 
to single-marker analysis from p values > E-01 for the individual markers to p value < 2 E-06 

30 for the four-marker "haplotype 8". See Table 3 . 

The significance of the values obtained for this haplotype association analysis was 
evaluated by the following computer simulation. The genotype data from the Alzheimer's 
disease cases and the unaffected controls were pooled and randomly allocated to two groups 
which contained the same number of individuals as the case/control groups used to produce 
35 the data summarized in Figure 7. A four-marker haplotype analysis (99-344-439 ; 99-355- 
219 ; 99-359-308 ; and 99-366-274) was run on these artificial groups. This experiment was 
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reiterated 100 times and the results are shown in Figure 8. No haplotype among those 
generated was found for which the p-value of the frequency difference between both 
populations was more significant than 1 E-05. In addition, only 4% of the generated 
haplotypes showed p-values lower than I E-04. Since both these p-value thresholds are less 
5 significant than the 2 E-06 p-value showed by "haplotype 8", this haplotype can be considered 
significantly associated with Alzheimer's disease. 

In a second step, marker 99-365-344 was included in the haplotype analyzes. The 
frequency differences between the affected and non affected populations was calculated for all 
two-, three-, four- or five-marker haplotypes involving markers: 99-344-439; 99-355-219; 99- 
10 359-308; 99-366-274; and 99-365-344. The most significant p-values obtained in each 

category of haplotype (involving two, three, four or five markers) were examined depending 
on which markers were involved or not within the haplotype. This showed that all haplotypes 
which included marker 99-365-344 showed a significant association with Alzheimer's disease 
(p-values in the range of E-04 to E-ll). 
1 5 An additional way of evaluating the significance of the values obtained in the 

haplotype association analysis was to perform a similar Alzheimer's disease case-control 
study on biallelic markers generated from BACs containing inserts corresponding to genomic 
regions derived from chromosomes 13 or 21 and not known to be involved in Alzheimer's 
disease. Performing similar haplotype and individual association analyzes as those described 
20 above and in Example 10 did not generate any significant association results (all p-values for 
haplotype analyzes were less significant than E-03; all p-valucs for single marker association 
studies were less significant than E-02). 

Example 8 

Genotyping of biallelic markers using microse quencing procedures 
25 Several microsequencing protocols conducted in liquid phase are well known to those 

skilled in the art. A first possible detection analysis allowing the allele characterization of the 
microsequencing reaction products relies on detecting fluorescent ddNTP- extended 
microsequencing primers after gel electrophoresis. A first alternative to this approach consists 
in performing a liquid phase microsequencing reaction, the analysis of which may be carried 
30 out in solid phase. 

For example, the microsequencing reaction may be performed using 5'-biotinylated 
oligonucleotide primers and fluorescein-dideoxynucleotides. The biotinylated oligonucleotide 
is annealed to the target nucleic acid sequence immediately adjacent to the polymorphic 
nucleotide position of interest. It is then specifically extended at its 3'-end following a PCR 
35 cycle, wherein the labeled dideoxynucleotide analog complementary to the polymorphic base 
is incorporated. The biotinylated primer is then captured on a microliter plate coated with 
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streptavidin. The analysis is thus entirely carried out in a microtiter plate format. The 
incorporated ddNTP is detected by a fluorescein antibody - alkaline phosphatase conjugate. 

In practice this microsequencing analysis is performed as follows. 20 pi of the 
microsequencing reaction is added to 80 pi of capture buffer (SSC 2X, 2.5% PEG 8000, 0.25 

5 M Tris pH7.5, 1 .8% BSA, 0.05% Tween 20) and incubated for 20 minutes on a microtiter 

plate coated with streptavidin (Boehringer). The plate is rinsed once with washing buffer (0. 1 
M Tris pH 7.5, 0.1 MNaCl, 0.1% Tween 20). 100 pi of anti-fluorescein antibody conjugated 
with phosphatase alkaline, diluted 1/5000 in washing buffer containing 1.8% BSA is added to 
the microtiter plate. The antibody is incubated on the microtiter plate for 20 minutes. After 

10 washing the microtiter plate four times, 100 \i\ of 4-methylumbelIiferyl phosphate (Sigma) 

diluted to 0.4 mg/ml in 0.1 M diethanolamine pH 9.6, lOmM MgCl 2 are added. The detection 
of the microsequencing reaction is carried out on a fluorimeter (Dynatech) after 20 minutes of 
incubation. 

As another alternative, solid phase microsequencing reactions have been developed, 
15 for which either the oligonucleotide microsequencing primers or the PCR-amplified products 
derived from the DNA fragment of interest are immobilized. For example, immobilization can 
be carried out via an interaction between biotinylated DNA and streptavidin-coated 
microtitration wells or avidin-coated polystyrene particles. 

As a further alternative, the PCR reaction generating the amplicons to be genotyped 
20 can be performed directly in solid phase conditions, following procedures such as those 
described in WO 96/13609. 

In such solid phase microsequencing reactions, incorporated ddNTPs can either be 
radiolabeled (see Syvanen, Clin. Chim. Acta. 226:225-236 (1994)) or linked to fluorescein 
(see Livak and Hainer, Hum. Metal. 3:379-385 (1994)). The detection of radiolabeled ddNTPs 
25 can be achieved through scintillation-based techniques. The detection of fluorescein-linked 
ddNTPs can be based on the binding of antifluorescein antibody conjugated with alkaline 
phosphatase, followed by incubation with a chromogenic substrate (such as p-nitrophenyl 
phosphate). 

Other possible reporter-detection couples for use in the above microsequencing 

30 procedures include: 

-ddNTP linked to dinitrophenyl (DNP) and anti-DNP alkaline phosphatase conjugate 

(see Harju et al., Clin Chem:39{\ lPt l):2282-2287 (1993)) 

-biotinylated ddNTP and horseradish peroxidasc-conjugated streptavidin with o- 
phenylenediamine as a substrate (see WO 92/15712). 
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A diagnosis kit based on fluorescein-linked ddNTP with antifluorescein antibody 
conjugated with alkaline phosphatase has been commercialized under the name PRONTO by 
GamidaGen Ltd. 

As yet another alternative microsequencing procedure, Nyren et al. (Anal. Biochem. 

5 208:171-175 (1993)) have described a solid-phase DNA sequencing procedure that relies on 
the detection of DNA polymerase activity by an enzymatic luminometric inorganic 
pyrophosphate detection assay (ELIDA). In this procedure, the PCR-amplified products are 
biotinylated and immobilized on beads. The microsequencing primer is annealed and four 
aliquots of this mixture are separately incubated with DNA polymerase and one of the four 

10 different ddNTPs. After the reaction, the resulting fragments are washed and used as 

substrates in a primer extension reaction with all four dNTPs present. The progress of the 
DNA-directed polymerization reactions is monitored with the ELIDA. Incorporation of a 
ddNTP in the first reaction prevents the formation of pyrophosphate during the subsequent 
dNTP reaction. In contrast, no ddNTP incorporation in the first reaction gives extensive 

15 pyrophosphate release during the dNTP reaction and this leads to generation of light 

throughout the ELIDA reactions. From the ELIDA results, the identity of the first base after 
the primer is easily deduced. 

It will be appreciated that several parameters of the above-described microsequencing 
procedures may be successfully modified by those skilled in the art without undue 

20 experimentation. In particular, high throughput improvements to these procedures may be 
elaborated, following principles such as those described further below. 

Example 9 
Sequence Analysis 

DNA sequences, such as BAG inserts, containing the region carrying the candidate 
25 gene associated with the detectable trait are sequenced and their sequence is analyzed using 
automated software which eliminates repeat sequences while retaining potential gene 
sequences. The potential gene sequences are compared to numerous databases to identify 
potential exons using a set of scoring algorithms such as trained Hidden Markov Models, 
statistical analysis models (including promoter prediction tools) and the GRAIL neural 
30 network. Preferred databases for use in this analysis, the construction and use of which are 
further detailed in Example 17, include the following: 

NRPU (Non-Redundant Protein-Unique^ database : NRPU is a non-redundant merge 
of the publicly available NBRF/TIR, Genpept, and SwissProt databases. Homologies found 
with NRPU allow the identification of regions potentially coding for already known proteins 
35 or related to known proteins (translated exons). 
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TJRF.ST (Non-Redundant EST database): NREST is a merge of the EST subsection of 
the publicly available GenBank database. Homologies found with NREST allow the location 
of potentially transcribed regions (translated or non-translated exons). 

NRN (Non-Redundant Nucleic acid database): NRN is a merge of GenBank, EMBL 

5 and their daily updates. 

Any sequence giving a positive hit with NRPU, NREST or an "excellent" score using 
GRAIL or/and other scoring algorithms is considered a potential functional region, and is then 
considered a candidate for genomic analysis. 

While this first screening allows the detection of the "strongest" exons, a semi- 
10 automatic scan is further applied to the remaining sequences in the context of the sequence 
assembly. That is, the sequences neighboring a 5' site or an exon are submitted to another 
round of bioinformatics analysis with modified parameters. In this way, new exon candidates 
are generated for genomic analysis. 

Using the above procedures, genes associated with detectable traits may be 

15 identified. 

Example 10 

YAC Contig Construction in the Candidate Genomic Regi on 
Substantial amounts of LOH data supported the hypothesis that genes associated with 
distinct cancer types arc located within a particular region of the human genome. More 
20 specifically, this region was likely to harbor a gene associated with prostate cancer. 

Association studies were performed as described below in order to identify this 
prostate cancer gene. First, a YAC contig which contains the candidate genomic region was 
constructed as follows. The CEPH-Genethon YAC map for the entire human genome 
(Chumakov et al. (1995), supra) was used for detailed contig building in the genomic region 
25 containing genetic markers known to map in the candidate genomic region. Screening data 
available for several publicly available genetic markers were used to select a set of CEPH 
YACs localized within the candidate region. This set of YACs was tested by PCR with the 
above mentioned genetic markers as well as with other publicly available markers supposedly 
located within the candidate region. As a result of these studies, a YAC STS contig map was 
30 generated around genetic markers known to map in this genomic region. Two CEPH YACs 
were found to constitute a minimal tiling path in this region, with an estimated size of ca. 2 
Megabases. 

During this mapping effort, several publicly known STS markers were precisely 
located within the contig. 

35 Example 1 1 below describes the identification of sets of biallclic markers within the 

candidate genomic region. 
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Example 1 1 
BAC contig construction and 
Biallelic Markers isolation within the candidate chromosomal region. 
Next, a BAC contig covering the candidate genomic region was constructed as 
5 follows. BAC libraries were obtained as described in Woo et al. f Nucleic Acids Res. 22:4922- 
4931 (1994). Briefly, the two whole human genome BamHI and HindHI libraries already 
described in related WIPO application No. PCT/IB98/00193 were constructed using the 
pBeloBACll vector (Kim et al. (1996), supra). 

The BAC libraries were then screened with all of the above mentioned STSs, 
10 following the procedure described in Example 1 above. 

The ordered BACs selected by STS screening and verified by FISH, were assembled 
into contigs and new markers were generated by partial sequencing of insert ends from some 
of them. These markers were used to fill the gaps in the contig of BAC clones covering the 
candidate chromosomal region having an estimated size-ef 2 megabases. 
1 5 Figure 9 illustrates a minimal array of overlapping clones which was chosen for 

further studies, and the positions of the publicly known STS markers along said contig. 

Selected BAC clones from the contig were subcloned and sequenced, essentially 
following the procedures described in related WEPO application No. PCTAB98/00193. 

Biallelic markers lying along the contig were identified following the processes 
20 described in related WIPO application No. PCT/IB98/00193. 

Figure 9 shows the locations of the biallelic markers along the BAC contig. This first 
set of markers corresponds to a medium density map of the candidate locus, with an inter- 
marker distance averaging 50kb-150kb. 

A second set of biallelic markers was then generated as described above in order to 
25 provide a very high-density map of the region identified using the first set of markers which 
can be used to conduct association studies, as explained below. This very high density map 
has markers spaced on average every 2-50kb. 

The biallelic markers were then used in association studies. DNA samples were 
obtained from individuals suffering from prostate cancer and unaffected individuals as 
30 described in Example 12. 

Example 12 

Collection of DNA Samples from Affected a nd Nnn-affected Individuals 
Prostate cancer patients were recruited according to clinical inclusion criteria based 
on pathological or radical prostatectomy records. Control cases included in this study were 
35 both ethnically- and age-matched to the affected cases; they were checked for both the 
absence of all clinical and biological criteria defining the presence or the risk of prostate 
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cancer, and for the absence of related familial prostate cancer cases. Both affected and control 
individuals were all unrelated. 

The two following groups of independent individuals were used in the association 
studies. The first group, comprising individuals suffering from prostate cancer, contained 185 
5 individuals. Of these 185 cases of prostate cancer, 47 cases were sporadic and 138 cases were 
familial. The control group contained 104 non-diseased individuals. 

Haplotype analysis was conducted using additional diseased (total samples: 281) and 
control samples (total samples: 130), from individuals recruited according to similar criteria. 

DNA was extracted from peripheral venous blood of all individuals as described in 
10 related WIPO application No. PCT/TB98/00193. 

The frequencies of the biallelic markers in each population were determined as 
described in Example 13. 

Example 13 
Genotvping Affected and Control Individuals 
15 Genotyping was performed using the following microsequencing procedure. 

Amplification was performed on each DNA sample using primers designed as previously 
explained. The pairs of primers of SEQ IDNos.: 7849 to 7860 and 1 1780 to 1 1791 were used 
to generate amplicons harboring the biallelic markers of SEQ ID Nos: 3915 to 3926 or the 
sequences complementary thereto (markers 99-123-381, 4-26-29, 4-14-240, 4-77-151, 99-217- 
20 277, 4-67-40, 99-213-164, 99-221-377, 99-135-196, 99-1482-32, 4-73-134, and 4-65-324) 
using the protocols described in related WIPO application No. PCT/IB98/00193. 

Microsequencing primers were designed for each of the biallelic markers, as 
previously described. After purification of the amplification products, the microsequencing 
reaction mixture was prepared by adding, in a 20ul final volume: 10 pmol microsequencing 
25 oligonucleotide, 1 U Thermosequenase (Amersham E79000G), 1 .25 fil Thermosequenasc 
buffer (260 mM Tris HC1 pH 9.5, 65 mM MgCl 2 ), and the two appropriate fluorescent 
ddNTPs (Perkin Elmer, Dye Terminator Set 401095) complementary to the nucleotides at the 
polymorphic site of each biallelic marker tested, following the manufacturer's 
recommendations. After 4 minutes at 94°C, 20 PCR cycles of 15 sec at 55°C, 5 sec at 72 6 C, 
30 and 1 0 sec at 94°C were carried out in a Tetrad PTC-225 thermocycler (MJ Research). The 
unincorporated dye terminators were then removed by ethanol precipitation. Samples were 
finally resuspended in formamide-EDTA loading buffer and heated for 2 min at 95°C before 
being loaded on a polyacrylamide sequencing gel. The data were collected by an ABI PRISM 
377 DNA sequencer and processed using the GENESCAN software (Perkin Elmer). 
35 Following gel analysis, data were automatically processed with software that allows 

the determination of the alleles of biallelic markers present in each amplified fragment. 
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The software evaluates such factors as whether the intensities of the signals resulting 
from the above micro sequencing procedures are weak, normal, or saturated, or whether the 
signals are ambiguous. In addition, the software identifies significant peaks (according to 
shape and height criteria). Among the significant peaks, peaks corresponding to the targeted 
5 site are identified based on their position. When two significant peaks are detected for the 

same position, each sample is categorized as homozygous or heterozygous based on the height 
ratio. 

Association analyzes were then performed using the biallelic markers as described 

below. 

10 Example 14 

Association Analysis 
Association studies were run in two successive steps. In a first step, a rough 
localization of the candidate gene was achieved by determining the frequencies of the biallelic 
markers of Figure 9 in the affected and unaffected populations. The results of this rough 
15 localization are shown in Figure 10. This analysis indicated that a gene responsible for 
prostate cancer was located near the biallelic marker designated 4-67. 

In a second phase of the analysis, the position of the gene responsible for prostate cancer was 
further refined using the very high density set of markers including the markers of SEQ ID 
Nos: 3915 to 3926 or the sequences complementary thereto (markers 99-123-381, 4-26-29, 4- 
20 14-240, 4-77-151, 99-217-277, 4-67-40, 99-213-164, 99-221-377, 99-135-196, 99-1482-32, 4- 
73-134, and 4-65-324). 

As shown in Figure 1 1, the second phase of the analysis confirmed that the gene 
responsible for prostate cancer was near the biallelic marker designated 4-67-40, most 
probably within a ca. 150kb region comprising the marker. 
25 A haplotype analysis was also performed as described in Example 15. 

Example 15 
Haplotype analysis 

The allelic frequencies of each of the alleles of biallelic markers 99-123-381, 4-26-29, 
4-14-240, 4-77-151, 99-217-277, 4-67^0, 99-213-164, 99-221-377, and 99-135-196 were 

30 determined in the affected and unaffected populations. Table 4 lists the internal identification 
numbers of the markers used in the haplotype analysis (SEQ ID Nos: 3915-3923), the alleles 
of each marker, the most frequent allele in both unaffected individuals and individuals 
suffering from prostate cancer, the least frequent allele in both unaffected individuals and 
individuals suffering from prostate cancer, and the frequencies of the least frequent alleles in 

35 each population. 

Table 4 
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15 



Frequency of least frequent allele 



Markers 


Polymorphic base * 


Cases 


Controls 


r\r\ l O "> lOI 

99-123-381 


err 


0.35 


0.3 


4-26-29 




0.39 


0.45 


| 4-14-240 


r* at* 


0.35 


0.41 


4-77-151 


C/G 


0.33 


0.24 


99-217-277 


C/T 


0.31 


0.23 


4-67-40 


C/T 


0.26 


0.16 


99-213-164 


T/C 


0.45 


0.38 


99-221-377 


C/A 


0.43 


0.43 


99-135-196 


A/G 


0.25 


0.3 



♦most frequent allele/least frequent allele 
♦♦standard deviations - 0.023 to 0.031 for controls 
-0.01 8 to 0.021 for cases 



Among all the theoretical potential different haplotypes based on 2 to 9 markers, 1 1 
haplotypes showing a strong association with prostate cancer were selected. The results of 
these haplotype analyzes are shown in Figure 12. 

Figures 1 1 and 12 aggregate association analysis results with sequencing results - 
20 generated following the procedures further described in Example 16, which permitted the 
physical order and the distance between markers to be estimated. 

The significance of the values obtained in Figure 12 are underscored by the following 
results of computer simulations. For the computer simulations, the data from the affected 
individuals and the unaffected controls were pooled and randomly allocated to two groups 
25 which contained the same number of individuals as the affected and unaffected groups used to 
compile the data summarized in F lg ure 12. A haplotype analys.s was run on these artificial 
groups for the six markers included in haplotype 5 of Figure 12. This experiment was 
reiterated 100 times and the results are shown in Figure 13. Among 100 iterations, only 5% of 
the obtained haplotypes are present with a p-value less significant than E-04 as compared to 
30 the p-value of 9E-07 for haplotype 5 of Figure 12. Furthermore, for haplotype 5 of Figure 12, 
only 6% of the obtained haplotypes have a significance level below 5 E -03, while none of them 
show a significance level below 5E-03. 

Thus, using the data of Figure 13 and evaluating the associations for single marker 
alleles or for haplotypes will permit estimation of the risk a corresponding carrier has to 
35 develop prostate cancer. It will be appreciated that significance thresholds of relative risks 
will be more finely assessed according to the population tested. 
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Diagnostic techniques for determining an individual's risk of developing prostate 
cancer may be implemented as described below for the markers in the maps of the present 
invention, including the markers of SEQ ID Nos: 3915 to 3923 (markers 99-123-381, 4-26-29, 
4-14-240, 4-77-151, 99-217-277, 4-67-40, 99-213-164, 99-221-377, and 99-135-196). 

5 The above haplotype analysis indicated that 171kb of genomic DNA between biallelic 

markers 4-14-240 and 99-221-377 totally or partially contains a gene responsible for prostate 
cancer. Therefore, the protein coding sequences lying within this region were characterized to 
locate the gene associated with prostate cancer. This analysis, described in further detail 
below, revealed a single protein coding sequence in the 171 kb genomic region, which was 

10 designated as the PG1 gene. 

Example 16 

Identification of the Genomic Sequence in the Candidate Region 
Template DNA for sequencing the PG1 gene was obtained as follows. BACs E and F 
from Fig. 9 were subcloned as previously described, -Plasmid inserts were first amplified by PCR 
15 on PE 9600 thermocyclers (Perkin-Elmer), using appropriate primers, AmpliTaqGold (Perkin- 
Elmer), dNTPs (Boehringer), buffer and cycling conditions as recommended by the Perkin- 
Elmer Corporation. 

PCR products were then sequenced using automatic ABI Prism 377 sequencers (Perkin 
Elmer, Applied Biosystems Division, Foster City, CA). Sequencing reactions were performed 

20 using PE 9600 thermocyclers (Perkin Elmer) with standard dye-primer chemistry and 

ThermoSequenase (Amersham Life Science). The primers were labeled with the JOE, FAM, 
ROX and TAMRA dyes. The dNTPs and ddNTPs used in the sequencing reactions were 
purchased from Boehringer. Sequencing buffer, reagent concentrations and cycling conditions 
were as recommended by Amersham. 

25 Following the sequencing reaction, the samples were precipitated with EtOIl, 

resuspended in formamide loading buffer, and loaded on a standard 4% acrylamide gel. 
Electrophoresis was performed for 2.5 hours at 3000V on an ABI 377 sequencer, and the 
sequence data were collected and analyzed using the ABI Prism DNA Sequencing Analysis 
Software, version 2.1 .2. 

30 The sequence data obtained as described above were transferred to a proprietary 

database, where quality control and validation steps were performed. A proprietary base-caller 
flagged suspect peaks, taking into account the shape of the peaks, the inter-peak resolution, and 
the noise level The proprietary base-caller also performed an automatic trimming. Any stretch 
of 25 or fewer bases having more than 4 suspect peaks was considered unreliable and was 

35 discarded. 
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The sequence fragments from BAC subclones isolated as described above were 
assembled using Gap4 software from R. Staden (Bonfield et al. 1995). This software allows 
the reconstruction of a single sequence from sequence fragments. The sequence deduced from 
the alignment of different fragments is called the consensus sequence. Directed sequencing 
5 techniques (primer walking) were used to complete sequences and link contigs. 

Potential functional sequences were then identified as described in Example 17. 

Example 17 
Identification of Functional Sequences 
Potential exons in BAC-derived human genomic sequences were located by homology 
10 searches on protein, nucleic acid and EST (Expressed Sequence Tags) public databases. Main 
public databases were locally reconstructed as mentioned in Example 9. The protein database, 
NRPU (Non-redundant Protein Unique) is formed by a non-redundant fusion of the Genpept 
(Benson et al., Nucleic Acids Res. 24:1-5 (1996)), Swissprot (Bairoch, A. and Apweiler, R., 
Nucleic Acids Res. 24:21-25 (1996)) and PIR/NBRF (George et al., Nucleic Acids Res. 24:17-20 
15 (1996)) databases. Redundant data were eliminated by using the NRDB software (Benson et a). 
(1996), supra) and internal repeats were masked with the XNU software (Benson et al., supra). 
Homologies found using the NRPU database allowed the identification of sequences 
corresponding to potential coding exons related to known proteins. 

The EST local database is composed by the gbest section (1-9) of GenBank (Benson et 
20 al. (1 996), supra), and thus contains all publicly available transcript fragments. Homologies 
found with this database allowed the localization of potentially transcribed regions. 

The local nucleic acid database contained all sections of GenBank and EMBL 
(Rodriguez-Tome et al., Nucleic Acids Res. 24:6-12 (1996)) except the EST sections. Redundant 
data were eliminated as previously described. 
25 Similarity searches in protein or nucleic acid databases were performed using the 

BLAST software (Altschul et al., J. Mol. Biol. 215:403-110 (1990)). Alignments were refined 
using the Fasta software, and multiple alignments used Clustal W. Homology thresholds were 
adjusted for each analysis based on the length and the complexity of the tested region, as well as 
on the size of the reference database. 
30 Potential exon sequences identified as above were used as probes to screen cDNA 

libraries. Extremities of positive clones were sequenced and the sequence stretches were 
positioned on the genomic sequence determined above. Primers were then designed using the 
results from these alignments in order to enable the cloning of cDNAs derived from the gene 
associated with prostate cancer that was identified using the above procedures. 
35 The obtained cDNA molecules were then sequenced and results of Northern blot 

analysis of prostate mRNAs supported the existence of a major cDNA having a 5-6kb length. 
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The structure of the gene associated with prostate cancer was evaluated as described in Example 
18. 

Example 1 8 
Analysis of Gene Structure 
5 The intron/exon structure of the gene was finally completely deduced by aligning the 

mRNA sequence from the cDNA obtained as described above and the genomic DNA sequence 
obtained as described above. This alignment permitted the determination of the positions of 
the introns and exons, the positions of the start and end nucleotides defining each of the at 
least 8 exons, the locations and phases of the 5* and 3* splice sites, the position of the stop 
10 codon, and the position of the polyadenylation site to be determined in the genomic sequence. 
This analysis also yielded the positions of the coding region in the mRNA, and the locations 
of the polyadenylation signal and polyA stretch in the mRNA. 

The gene identified as described above comprises at least 8 exons and spans more 
than 52kb. A G/C rich putative promoter region was identified upstream of the coding 
15 sequence. A CCAAT in the putative promoter was also identified. The promoter region was 
identified as described in Prestridge, D.S., Predicting Pol II Promoter Sequences Using 
Transcription Factor Binding Sites, J. Mol Biol 249:923-932 (1995). 

Additional analysis using conventional techniques, such as a 5'RACE reaction using 
the Marathon-Ready human prostate cDNA kit from Clontech (Catalog. No. PT1 156-1), may 
20 be performed to confirm that the 5' of the cDNA obtained above is the authentic 5' end in the 
mRNA. 

Alternatively, the 5'sequence of the transcript can be determined by conducting a PGR 
amplification with a series of primers extending from the 5'end of the identified coding 
region. 

25 Example 19 

Detection of biallelic markers in the candidate gene : DNA extraction 
Donors were unrelated and healthy. They presented a sufficient diversity for being 
representative of a French heterogeneous population. The DNA from 100 individuals was 
extracted and tested for the detection of the biallelic markers. 
30 30 ml of peripheral venous blood were taken from each donor in the presence of 

EDTA. Cells (pellet) were collected after centrifugation for 10 minutes at 2000 rpm. Red cells 
were lysed by a lysis solution (50 ml final volume: 10 mM Tris pH7.6; 5 mM MgC12; 10 mM 
NaCI). The solution was centrifuged (10 minutes, 2000 rpm) as many times as necessary to 
eliminate the residual red cells present in the supernatant, after resuspension of the pellet in 
35 the lysis solution. 
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The pellet of white cells was lysed overnight at 42°C with 3.7 ml of lysis solution 
composed of: 

- 3 ml TE 10-2 (Tris-HCl 10 tnM, EDTA 2 mM) / NaCl 0.4 M 
-200 ul SDS 10% 

5 - 500 ul K-proteinase (2 mg K-proteinase in TE 1 0-2 / NaCl 0.4 M). 

For the extraction of proteins, 1 ml saturated NaCl (6M) (1/3.5 v/v) was added. After 
vigorous agitation, the solution was centrifuged for 20 minutes at 10000 rpm. 
For the precipitation of DNA, 2 to 3 volumes of 100% ethanol were added to the previous 
supernatant, and the solution was centrifuged for 30 minutes at 2000 rpm. The DNA solution 

10 was rinsed three times with 70% ethanol to eliminate salts, and centrifuged for 20 minutes at 
2000 rpm. The pellet was dried at 37°C, and resuspended in 1 ml TE 10-1 or 1 ml water. The 
DNA concentration was evaluated by measuring the OD at 260 nm (1 unit OD = 50 ng/ml 
DNA). 

To determine the presence of proteins in the DNA solution, the OD 260 / OD 280 
1 5 ratio was determined. Only DNA preparations having a OD 260 / OD 280 ratio between 1 .8 
and 2 were used in the subsequent examples described below. 

The pool was constituted by mixing equivalent quantities of DNA from each 

individual. 

Example 20 

20 Detection of the hiallelic markers: amp lification of genomic DNA byPCR 

The amplification of specific genomic sequences of the DNA samples of Example 19 
was carried out on the pool of DNA obtained previously using the amplification primers of 
SEQ IDNos: 7861 to 7865 and 11792 to 11796. In addition, 50 individual samples were 
similarly amplified. 

25 

PCR assays were performed using the following protocol: 
Final volume 25 ^ 

DNA 2n ^' 
MgC12 2 mM 

30 dNTP(each) 200 ^ M 

primer (each) 2 - 9n ^ 1 
Ampli Taq Gold DNA polymerase 0.05 unit/ul 

PCR buffer (lOx = 0.1 M TrisHCl pH8.3 0.5M KC1) lx 

35 Pairs of first primers were designed to amplify the promoter region, exons, and 3' end 

of the candidate asthma-associated gene using the sequence information of the candidate gene 
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and the OSP software (Hillier & Green, 1991). These first primers were about 20 nucleotides 
in length and contained a common oligonucleotide tail upstream of the specific bases targeted 
for amplification which was useful for sequencing. The synthesis of these primers was 
performed following the phosphoramidite method, on a GENSET UFPS 24.1 synthesizer. 

DNA amplification was performed on a Genius II thermocycler. After heating at 94°C 
for 10 min, 40 cycles were performed. Each cycle comprised: 30 sec at 94°C, 55°C for 1 min, 
and 30 sec at 72°C. For final elongation, 7 min at 72°C ended the amplification. The quantities 
of the amplification products obtained were determined on 96-well microtiter plates, using a 
fluorometer and Picogreen as intercalant agent (Molecular Probes). 

Example 21 
Detection of the biallelic markers 
Sequencing of amplified genomic DNA and identi fication of polymorphisms 
The sequencing of the amplified DNA obtained in Example 20 was carried out on ABI 
377 sequencers. The sequences of the amplification products were determined using 
automated dideoxy terminator sequencing reactions with a dye terminator cycle sequencing 
protocol. The products of the sequencing reactions were run on sequencing gels and the 
sequences were analyzed as formerly described. 

The sequence data were further evaluated using the above mentioned polymorphism 
analysis software designed to detect the presence of biallelic markers among the pooled 
amplified fragments. The polymorphism search was based on the presence of superimposed 
peaks in the electrophoresis pattern resulting from different bases occurring at the same 
position as described previously. 

Six fragments of amplification were analyzed. In these segments, 8 biallelic markers 
were detected (SEQ ID Nos: 3927 to 3934). The localization of the biallelic markers, the 
polymorphic bases of each allele, and the frequencies of the most frequent alleles was as 
shown in Table 5. 



Table 5 



jj Amplicon 


Marker 
Name 


Origin of 
DNA 


Localization in 
gene 


Polymorphism 


Frequency j 


1 


10-204-326 


Ind. 


Promoter 


A/G 


96.2 (G) 


2 


10-32-357 


Pool 


Intron 1 


A/C 


67.7 (C) 


3 


10-33-175 


Ind. 


Exon 2 


err 


97.3 (C) 


3 


10-33-234 


Pool 


Intron 2 


A/C 


56.7 (C) 


3 


10-33-327 


Ind. 


Intron 2 


err 


75.3 (T) 
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5 


10-35-358 


Pool 


Intron 4 


C/G 


67.9 (G) 


5 


10-35-390 


Ind. 


Intron 4 


cn 


82(C) 


6 


10-36-164 


Ind. 


Exon 5 


AJG 


99.5 (G) 



Allelic frequencies were determined in a population of random blood donors from French 
Caucasian origin. Their wide range is due to the fact that, besides screening a pool of 100 
individuals to generate biallelic markers as described above, polymorphism searches were also 
conducted in an individual testing format for 50 samples. This strategy was chosen here to 
provide a potential shortcut towards the identification of putative causal mutations in the 
association studies using them. As the 10-36-164 biallelic marker (SEQ ID No: 3933) was 
found in only one individual, this marker was not considered in the association studies. 

The fourth fragment of amplification carrying exon 3 (not shown in the Table) was 
not polymorphic in the tested samples (1 pool + 50 individuals). 

Example 22 

Validation of the polymorphisms thr ough microsequencinR 
The biallelic markers identified in Example 21 were further confirmed and their 

respective frequencies were determined through microsequencing. Microsequencing was 

carried out for each individual DNA sample described in Example 19. 

Amplification from genomic DNA of individuals was performed by PCR as described 

above for the detection of the biallelic markers with the same set of PCR pnmers described 

above. 

The preferred primers used in microsequencing had about 1 9 nucleotides in' length and 
hybridized just upstream of the considered polymorphic base. 

Five primers hybridized with the non-coding strand of the gene. For the biallelic markers 10- 
204-326, 10-35-358 and 10-36-164, primers hybridized with the coding strand of the gene. 
The microsequencing reaction was performed as described in Example 13. 

Example 23 

Association study between asthma and the b iallelic markers of the candidate gene 
Collection of DNA samples from affect ed and non-affected individuals 
The asthmatic population used to perform association studies in order to establish 
whether the candidate gene was an asthma-causing gene consisted of 298 individuals. More 
than 90 % of these 298 asthmatic individuals had a Caucasian ethnic background. 

The control population consisted of 373 unaffected individuals, among which 279 
French (at least 70 % were of Caucasian origin) and 94 American (at least 90 % were of 
Caucasian origin). 
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DNA samples were obtained from asthmatic and non-asthmatic individuals as 
described above. 

Example 24 

Association study between asthma and the biallelic markers of the cand idate gene 
Genotvmng of affected and control individuals 
The general strategy to perform the association studies was to individually scan the 
DNA samples from all individuals in each of the populations described above in order to 
establish the allele frequencies of the above described biallelic markers in each of these 
populations. 

Allelic frequencies of the above-described biallelic markers in each population were 
determined by performing microsequencing reactions on amplified fragments obtained by 
genomic PCR performed on the DNA samples from each individual. Genomic PCR and 
microsequencing were performed as detailed above in Examples 20 and 22 using the described 
amplification and microsequencing primers. — 

Example 25 

Association study between asthma and the biallelic markers of the candidate gene 
Table 6 shows the results of the association study between five biallelic markers in the 
candidate gene and asthma. 



Table 6 





Allelic 


frequencies (%) 






1 Markers 


Asthmatics Controls 
298 individuals 373 individuals 


Frequency diff. 


P value 


[KK32-357 


A 38.6 


A 29.8 


8.8 


7.34x10^ 


10-33-234 


A 49 


A 44.3 


4.7 


8.86xl0- 2 


10-33-327 


T78.5 


T74.6 


3.9 


1.0x10-' 


10-35-358 


G72.3 


G66.9 


5.4 


3.59x10"' 


10-35-390 


T30.4 


T20.3 


10.1 


2.33x10 s 



As shown in Table 6, markers 10-32-357 and 10-35-390 presented a strong association with 
asthma, this association being highly significant ( pvalue = 7.34x10-4 for marker 10-32-357 
and 2.33x10-5 for marker 10-35-390). 

Three markers showed moderate association when tested independently, namely 10- 
33-234, 10-33-327,10- 35-358. 

It is worth mentioning that allelic frequencies for each of the biallelic markers of 
Tabic 7 were separately measured within the French control population (279 individuals) an. 
the American control population (94 individuals). The differences in allele frequencies 
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between the two populations were between 1% and 7%, with p-values above Iff'. These data 
confirmed that the combined French/American control population (373 individuals) was 
homogeneous enough to be used as a control population for the present association study. 

Example 26 

5 Aviation Studies; Hao1otv p<» frequency analysis 

As already shown, one way of increasing the statistical power of individual markers, 
is by performing haplotype association analysis. A haplotype analysis for association of 
markers in the candidate gene and asthma was performed by estimating the frequencies of all 
possible haplotypes forbiallelic markers 10-32-357, 10-33-234, 10-33-327, 10-35-358 and 10- 

10 35-390 in the asthmatic and control populations described in Example 25 (Table 6), and 

comparing these frequences by means of a chi square statistical test (one degree of freedom). 
Haplotype estimations were performed by applying the Expectation-Maximization (EM) 
algorithm (Excoffier L & Slatkin M, 1995, Mol.Biol.Evol. 12 :921-927), using the EM- 
HAPLO program (Hawley ME, Pakstis AJ & Kidd KK, 1994, Am.J.Phys.Anthropol. 18:104). 

15 The results of such haplotype analysis are shown in Table 7. 

Table 7 



(Markers 10-32-357 10-33-234 10-33027 10-35-358 10-35-390 Astbm. Controls Odds Rvalue 
8.8 




Haplotype frequencies 



A 
A 



4.7 


3.9 


5.4 


10.1 




8.86xl0- J 


1.0x10-' 


3.59xl0 2 


2.33x1 0 4 




T 


0.2 


0.11 


2.02 


8.47x10-* 


T 


G 


0.27 


0.18 


1.68 2.81X10" 4 


A 


T 


G 


T 


0.18 0.09 



2.22 3.95x10" 



A two-marker haplotype covering markers 10-32-357 and 10-35-390 (haplotype 1, AT 
alleles respectively) presented a p value of 8.47x10-6, an odds ratio of 2.02 and haplotype 
frequencies of 0.2 for asthmatic and 0.11 for control populations respectively. 

A three-marker haplotype covering markers 10-33-234, 10-33-327 and 10-35-358 
35 (haplotype 2, ATG alleles respectively) presented a p value of 2.81x10^, an odds ratio of 1.68 
and haplotype frequencies of 0.27 for asthmatic and 0.1 8 for control populations respectively. 

A five-marker haplotype covering markers 10-32-357, 10-33-234, 10-33-327, 10-35- 
358 and 10-35-390 (haplotype 3, AATGT alleles respectively) presented a p value of 3.95x10- 
5, an odds ratio of 2.22 and haplotype frequencies of 0.1 8 for asthmatic and 0.09 for control 
40 populations respectively. 
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Haplotype association analysis thus increased the statistical power of the individual 
marker association studies when compared to single-marker analysis (from p values between 
10"' and 2X10" 5 for the individual markers to p values between 3X10" and 8X10"* for the 
three-marker haplotype, haplotype 2). 

The significance of the values obtained for the haplotype association analysis was 
evaluated by the following computer simulation test. The genotype data from the asthmatic 
and control individuals were pooled and randomly allocated to two groups which contained 
the same number of individuals as the trait-positive and trait-negative groups used to produce 
the data summarized in Table 7. A haplotype analysis was then run on these artificial groups 
for the three haplotypes presented in Table 6. This experiment was reiterated 1000 times and 
the results are shown in Table 8. 

Table 8 



Haplotype 


Chi-Square 
Average 


Chi-Square 


Permutation Test I 
Maximal P value 
Chi-Square 


Haplotype 1 
(A-T) 


19.70 


1.2 


11.6 


l.OxlO" 3 


1 Haplotype 2 
(-ATG-) 


13.49 


1.2 


10.5 


1.0x10° 


Haplotype 3 
| (AATGT) 


16.66 


1.2 


9.3 


l.OxlO 3 



The results in Table 8 show that among 1000 iterations only l%o of the obtained 
haplotypes has a pvalue comparable to the one obtained in Table 4. 

These results clearly validate the statistical significance of the haplotypes obtained 
(haplotypes 1, 2 and 3, Table 7). 

Example 27 
Extraction of DNA 

30 ml of blood are taken from the individuals in the presence of EDTA. Cells (pellet) 
are collected after centrifugation for 10 minutes at 2000 rpm. Red cells are lysed by a lysis 
solution (50 ml final volume : 10 mM Tris P H7.6; 5 mM MgCh; 10 mM NaCl). The solution 
is centrifuged (10 minutes, 2000 rpm) as many times as necessary to eliminate the residual red 
cells present in the supernatant, after resuspension of the pellet in the lysis solution. 

The pellet of white cells is lysed overnight at 42°C with 3.7 ml of lysis solution 
composed of: 

- 3 ml TE 10-2 (Tris-HCl 10 mM, EDTA 2 mM) / NaCl 0.4 M 
-200 ulSDS 10% 
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- 500 ul K-proteinase (2 mg K-proteinase in TE 1 0-2 / NaCI 0.4 M). 

For the extraction of proteins, 1 ml saturated NaCI (6M) (1/3.5 v/v) is added. After 
vigorous agitation, the solution is centrifuged for 20 minutes at 10000 rpm. 
For the precipitation of DNA, 2 to 3 volumes of 100% ethanol are added to the previous 
supernatant, and the solution is centrifuged for 30 minutes at 2000 rpm. The DNA solution is 
rinsed three times with 70% ethanol to eliminate salts, and centrifuged for 20 minutes at 2000 
rpm. The pellet is dried at 37°C, and resuspended in 1 ml TE 10-1 or 1 ml water. The DNA 
concentration is evaluated by measuring the OD at 260 nm (1 unit OD = 50 ug/ml DNA). 

To evaluate the presence of proteins in the DNA solution, the OD 260 / OD 280 ratio 
is determined. Only DNA preparations having a OD 260 / OD 280 ratio between 1 .8 and 2 are 
used in the subsequent steps described below. 

Once genomic DNA from every individual in the given population has been extracted, 
it is preferred that a fraction of each DNA sample is separated, after which a pool of DNA is 
"constituted by assembling equivalent DNA amounts of the separated fractions into a single 
one. 
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TABLE 1 



SEQ ID 
No. 



1 



Marker Name 



99-109-224 



99-1126-384 



99-114-68 



Allele 



rsr- 



_C_ 

c 



Preferred 
microseq. 
primer 



S 
S 



Amplification primer 



Upstream 
(PU) 



3935 



3936 



3937 



Downstream 
(RP) 



7866 



7867 



7868 



4 



99-1151-516 



99-1165-159, 



99-1167-201 
99-117-205 



A 
C 



G 
T 



_S_ 
S 



A 
S 



3939 



7870 



3940 
3941 



7871 
7872 



10 



11 

12 
13 



14 



15 
16 



17 
18 



19 
20 
21 



23 



24 



25 



26 



27 
28 



29 



30 



31 



32 



33 
34 



35 



36 



37 



38 



39 



40 



41 



42 



99-118-92 



99-1217-332 



99-1233-183 



99-12478-263 



99-12487-301 



99-12497-155 



99-12503-44 



99-12504-402 



99-12505-374 



99-12509-423 



99-12513-146 



99-12514-170 
99-12515-205 



99-12516-524 



99-12518-325 
99-12523-255 



99-12525-277 



G 
G 



G 
T 



C 
C 



G 



99-12526-317 



99-12527-292 
99-12531-30 



99-12532-199 



99-12534-207 



9942535-362 



99-12537-340 



99-12538-142 



99-12539-287 



99-12540-426 



99-12541-307 



99-12545-121 



99-12548-88 



99-12558-167 



99-12562-291 



99-12564-354 



99-12565-273 



99-12575-248 
99-12576-325 
99-12580-268 
99-12585-85 
99-12593-103 



A 
C 



T 
T 



G 
T 



T 
C 



S 



A 
S 



A 
S 



c 



s 



3943 



3944 



7874 
7875 



3945 
3946 



7876 



7877 



3947 



7878 



3948 
3949 



7879 



7880 



3950 



7881 



3951 



7882 



3952 



3953 



7884 



3954 
3955 



7885 
7886 



3956 



7887 



3957 



7888 



A 
A 



S 



s 
s 



3958 



7889 



3959 



7890 



3960 



3961 
3962 



3963 



3964 



3965 



3966 



3967 



3969 



3970 



3971 



3972 



3973 



3974 



3975 



3976 



7891 



7892 
7893 



7894 



7895 



7896 



7897 
7898 



7899 



7901 



7902 



7903 



7904 
7905 



7906 



7907 



3982 
3983 



7913 
7914 



48 
49 



99-12608-71 
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T 
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3984 


7915 


CI 


00 19^11-111 1 


G 
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3985 


7916 


j J. 


00 19611-166 
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A 


3986 


7917 




00-15615-235 


A 


c 


S 


3987 


7918 


>4 


00 19617-412 I 
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S 


398S 


7919 


c< 
jj 


00.12618-21 1 


c 


T 
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3989 


7920 


<A 
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yy- 1 zo 1 7'ju ' 
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3990 


7921 




00 12621-114 
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3991 
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3992 


7923 


59 
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3993 
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3994 


7925 


01 
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3995 
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WW AT T£ CLAIMED IS : 

1. An isolated or purified polynucleotide comprising a contiguous span of at least 12 
nucleotides of a sequence selected from the group consisting of SEQ ID No. I to 2260, and the 
complements thereof. 

2. A polynucleotide according to claim 1, wherein said span comprises a map-related 
biallelic marker. 

3. An isolated or purified polynucleotide consisting essentially of a contiguous span of 
at least 8 to 43 nucleotides of a sequence selected from the group consisting of SEQ ID No. 
2261 to 3734, 3735 to 3908, and the complements thereof. 

4. A polynucleotide according to claim 3, wherein said span comprises a map-related 
i biallelic marker. 

5. An isolated or purified polynucleotide comprising a contiguous span of at least 12 
nucleotides of a sequence selected from the group consisting of SEQ ID No. 2261 to 3734, 
and the complements thereof, wherein said span comprises a map-related biallelic marker and 

) the 1st allele indicated in Table 1 is present at said map-related biallelic marker. 

6. A polynucleotide according to any one of claims 2, 4, and 5, wherein said 
contiguous span is 18 to 35 nucleotides in length and said biallelic marker is within 4 
nucleotides of the center of said polynucleotide. 

25 

7. A polynucleotide according to claim 6, wherein said polynucleotide consists 
essentially of said contiguous span and said contiguous span is 25 nucleotides in length and 
said biallelic marker is at the center of said polynucleotide. 

30 8. An isolated or purified polynucleotide comprising a contiguous span of at least 12 

nucleotides of a sequence selected from the group consisting of SEQ ID No. 3935 to 6194, 
7866 to 10125, and the complements thereof. 

9. An isolated or purified polynucleotide consisting essentially of a contiguous span of 
35 at least 8 to 43 nucleotides of a sequence selected from the group consisting of SEQ ID No. 

6195 to 7668, 7669 to 7842, 10126 to 11599, 1 1600 to 1 1773, and the complements thereof. 
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10. A polynucleotide according to any one of claims 1, 3, 8, and 9, wherein the 3' end 
of said contiguous span is present at the 3* end of said polynucleotide. 

5 1 1. A polynucleotide according to any one of claims 2, 3, and 5, wherein the V end of 

said contiguous span is located at the 3' end of said polynucleotide and said biallelic marker is 
present at the 3* end of said polynucleotide. 

12. A polynucleotide according to either of claims 1 and 3, wherein the 3' end of said 
10 contiguous span is present at the 3' end of said polynucleotide and the 3' end of said 

polynucleotide is located within 10 nucleotides upstream of a map-related biallelic marker in 
said sequence. 

13. A polynucleotide according to claim 12, wherein the 3* end of said polynucleotide 
15 is located 1 nucleotide upstream of a map-related biallelic marker in said sequence. 

14. A polynucleotide according to claim 13, wherein said contiguous span is 19 
nucleotides in length and said polynucleotide consists of said contiguous span. 

20 15. A polynucleotide according to any one of claims 1, 3, 5, 8, and 9 wherein said 

contiguous span comprises at least 21 contiguous nucleotides. 

16. A polynucleotide according to any one of claims 1, 3, and 5, wherein said 
contiguous span comprises at least 30 contiguous nucleotides. 

25 

17. A polynucleotide according to any one of claims 1, 3, and 5, wherein said 
contiguous span comprises at least 43 contiguous nucleotides. 

18. A polynucleotide for use in determining the identity of nucleotides at a map- 
30 related biallelic marker, wherein said determining is performed in a hybridization assay, 

sequencing assay, microsequencing assay, or an enzyme-based mismatch detection assay. 

19. A polynucleotide for use in amplifying a segment of nucleotides comprising a 
map-related biallelic marker. 
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20. A polynucleotide according to either of claims 18 and 19, wherein said map- 
related biallelic marker is selected from the group consisting of the biallelic markers of SEQ 
ID Nos. 1 to 3908, and the complements thereto. 

5 21. A polynucleotide according to either of claims 18 and 19, wherein said map-related 

biallelic marker is selected from the group consisting of the biallelic markers of SEQ ID Nos. 
1 to 2260, 2261 to 3734, and the complements thereto. 

22. A polynucleotide according to any one of claims 1, 3, 5, 8, 9, 18, and 19 attached 
10 to a solid support. 

23. An array of polynucleotides comprising at least one polynucleotide according to 
claim 22. 

15 24. An array according to claim 23, wherein said array is addressable. 

25. A polynucleotide according to any one of claims 1, 3, 5, 7, 8, 9, 14, 18, and 19, 
further comprising a label. 

20 26. A map of the human genome comprising an ordered array of biallelic markers, 

wherein at least 1 of said biallelic markers is a map-related biallelic marker. 

27. A map of according to claim 26, comprising all of the biallelic markers of SEQ ID 
Nos. I to 3908, and the complements thereto. 

25 

28. A method of genotyping comprising determining the identity of a nucleotide at a 
map-related biallelic marker in a biological sample. 

29. A method according to claim 28, wherein said map-related biallelic marker is 

30 selected from the group consisting of the biallelic markers of SEQ ID Nos. 1 to 3908, and the 
complements thereto. 

30. A method according to claim 28, wherein said map-related biallelic marker is 
selected from the group consisting of the biallelic markers of SEQ ID Nos. 1 to 2260, 2261 to 

35 3734, and the complements thereto. 
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31. A method according to claim 28, wherein said biological sample is derived from a 
single subject. 

32. A method according to claim 31, wherein the identity of the nucleotides at said 

5 biallelic marker is determined for both copies of said biallelic marker present in said subject's 
genome. 

33. A method according claim 28, wherein said biological sample is derived from 
multiple subjects. 

10 

34. A method according to claim 28, further comprising amplifying a portion of said 
sequence comprising the biallelic marker prior to said determining step. 

35. A method according to claim 34, wherein said amplifying is performed by PCR. 

15 

36. A method according to claim 28, wherein said determining is performed by a 
hybridization assay, a sequencing assay, a microsequencing assay, or an enzyme-based 
mismatch detection assay. 

20 37. A method of determining the frequency in a population of an allele of a map- 

related biallelic marker, comprising: 

a) genotyping individuals from said population for said biallelic marker according to 

the method of claim 28; and 

b) determining the proportional representation of said biallelic marker in said 

25 population. 

38. A method according to claim 37, wherein said map-related biallelic marker is 
selected from the group consisting of the biallelic markers of SEQ ID Nos. 1 to 3908, and the 
complements thereto. 

30 

39. A method according to claim 37, wherein said map-related biallelic marker is 
selected from the group consisting of the biallelic markers of SEQ ID Nos. 1 to 2260, 2261 to 
3734, and the complements thereto. 



35 40. A method according to claim 37, wherein said genotyping of step a) is performed 

on each individual of said population. 
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41. A method according to claim 37, wherein said genotyping is performed on a single 
biological sample derived from said population. 

5 42. A method of detecting an association between an allele and a phenotype, 

comprising the steps of: 

a) determining the frequency of at least one map-related biallelic marker allele in a trait 
positive population according to the method of claim 37; 

b) determining the frequency of said map-related biallelic marker allele in a control 
10 population according to the method of claim 37; and 

c) determining whether a statistically significant association exists between said allele 

and said phenotype. 

43. A method of estimating the frequency of a haplotype for a set of biallelic markers 
15 in a population, comprising: 

a) genotyping each individual in said population for at least one map-related biallelic 

marker according to claim 31; 

b) genotyping each individual in said population for a second biallelic marker by 
determining the identity of the nucleotides at said second biallelic marker for both copies of 

20 said second biallelic marker present in the genome; and 

c) applying a haplotype determination method to the identities of the nucleotides 
determined in steps a) and b) to obtain an estimate of said frequency. 

44. A method according to claim 43, wherein said haplotype determination method is 
25 selected from the group consisting of asymmetric PCR amplification, double PCR 

amplification of specific alleles, the Clark method, or an expectation maximization algorithm. 

45. A method according to claim 43, wherein said map-related biallelic marker is 
selected from the group consisting of the biallelic markers of SEQ ID Nos. 1 to 3908, and the 

30 complements thereto. 

46. A method according to claim 43, wherein said map-related biallelic marker is 
selected from the group consisting of the biallelic markers of SEQ ID Nos. 1 to 2260, 2261 to 
3734, and the complements thereto. 
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47. A method of detecting an association between a haplotype and a phenotype, 

comprising the steps of: 

a) estimating the frequency of at least one haplotype in a trait positive population 

according to the method of claim 43; 
5 b) estimating the frequency of said haplotype in a control population according to the 

method of claim 43; and 

c) determining whether a statistically significant association exists between said 

haplotype and said phenotype. 

j 0 48. A method according to cither claim 42 or 47, wherein said control population is a 

trait negative population. 

49. A method according to either claim 42 or 47, wherein said case control population 
is a random population. 

15 

50. A method according to claim 42, wherein each of said genotyping of steps a) and 
b) is performed on a single pooled biological sample derived from each of said populations. 

51. A method according to claim 42, wherein said genotyping of steps a) and b) is 
20 performed separately on biological samples derived from each individual in said populations. 

52. A method according to cither claim 42 or 47, wherein said phenotype is selected 
from the group consisting of disease, drug response, drug efficacy, treatment response, 
treatment efficacy, and drug toxicity. 

25 

53. A method according to claim 42, wherein the identity of the nucleotides at all of 
the biallelic markers of SEQ ID Nos. 1 to 3908 is determined in steps a) and b). 

54. A computer readable medium having stored thereon the sequence of the 

30 " polynucleotide according to any one of the claims selected from the group consisting of 1 , 3, 
5,7, 8, 9 and 14. 

55. A computer system comprising a processor and a data storage device wherein said 
data storage device has stored thereon the sequence of the polynucleotide according to any one 

35 of the claims selected from the group consisting of 1, 3, 5, 7, 8, 9 and 14. 
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56. The computer system of Claim 55, further comprising a sequence comparer and a 
data storage device having reference sequences stored thereon. 

57. A method for comparing a first sequence to a reference sequence, comprising the 
5 steps of: 

a) reading said first sequence and said reference sequence through use of a computer 
program which compares sequences; and 

b) determining differences between said first sequence and said reference sequence with 

said computer program; 

10 wherein said first sequence is the sequence of the polynucleotide according to any one 

of the claims selected from the group consisting of 1, 3, 5, 7, 8, 9 and 14. 

58. A diagnostic kit comprising a polynucleotide according to any one of claims 1,3, 
5,7, 8,9, 14, 18, and 19. 

15 

59. A method of identifying a gene associated with a detectable trait comprising the 
steps of: 

a) determining the frequency of each allele of at least one map-related biallelic marker 
in individuals having said detectable trait and individuals lacking said detectable trait 

20 according to the method of claim 4 1 ; 

b) identifying at least one allele of said biallelic marker having a statistically 
significant association with said detectable trait; and 

c) identifying a gene in linkage disequilibrium with said allele. 



25 



60. The method according to claim 59, further comprising the step of: d) identifying a 
mutation in gene which is associated with said detectable trait. 



61. A method of identifying biallelic markers associated with a detectable trait 
comprising the steps of: 

30 a) determining the frequencies of a set of biallelic markers comprising at least one 

map-related biallelic marker in individuals who express said detectable trait and individuals 
who do not express said detectable trait; and 

b) identifying at least one biallelic marker in said set which are statistically associated 
with the expression of said detectable trait. 
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62. A method for determining whether an individual is at risk of developing a 
detectable trait or suffers from a detectable trait associated with said trait comprising the steps 
of: 

a) obtaining a nucleic acid sample from said individual; 
5 b) screening said nucleic acid sample with at least one map-related biallelic marker; 

and 

c) determining whether said nucleic acid sample contains at least one biallelic marker 
statistically associated with said detectable trait 



10 



63. The method according to any one of claims 59, 61 and 62, wherein said detectable 
trait is selected from the group consisting of disease, drug response, drug efficacy, treatment 
response, treatment efficacy, and drug toxicity. 

64. A method of administering a drug or treatment comprising: 

15 a) obtaining a nucleic acid sample from an individual; 

b) determining the identity of the polymorphic base of at least one map-related 
biallelic marker according to the method of claim 3 1 which is associated with a positive 
response to said drug or treatment, or at least one map-related biallelic marker which is 
associated with a negative response to said drug or treatment; and 

20 c) administering said drug or treatment to said individual if said nucleic acid sample 

contains at least one biallelic marker associated with a positive response to said drug or 
treatment, or if said nucleic acid sample tacks at least one biallelic markers associated with a 
negative response to said drug or treatment. 

25 65. A method of selecting an individual for inclusion in a clinical trial of a drug or 

treatment comprising: 

a) obtaining a nucleic acid sample from an individual; 

b) determining the identity of the polymorphic base of at least one map-related 
biallelic marker according to the method of claim 31 which is associated with a positive 

30 response to said drug or treatment, or at least one biallelic marker associated with a negative 
response to said drug or treatment in said nucleic acid sample; and 

c) including said individual in said clinical trial if said nucleic acid sample contains at 
least one biallelic marker which is associated with a positive response to said drug or 
treatment, or if said nucleic acid sample lacks at least one biallelic markers associated with a 

35 negative response to said drug or treatment. 
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66; The method according to either of claims 64 and 65, wherein said administering 
step comprises administering said drug or treatment to said individual if said nucleic acid 
sample contains at least one biallelic marker associated with a positive response to said drug 
or treatment, and said nucleic acid sample lacks at least one biallelic marker associated with a 
negative response to said drug or treatment. 

67. The method according to any one of claims 59, 61, 62, 64, and 65, wherein said 
map-related biallelic marker is selected from the group consisting of the biallelic markers of 
SEQ ID Nos. 1 to 3908. 

63. The method according to any one of claims 59, 61, 62, 64, and 65, wherein said 
map-related biallelic marker is selected from the group consisting of the biallelic markers of 
SEQ ID Nos. 1 to 3734. 
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